chartgrade
← Library·Data & backtesting·Ch. 12

Backtesting & data — how to not lie to yourself

Walk-forward, Deflated Sharpe, look-ahead bias, cost modeling.

16 min read2,630 words29 sections

FX Backtesting Bible: Frameworks, Data, and Why Most Retail Backtests Lie

A practical, opinionated reference for anyone building an AI-assisted FX trading system. If your backtest is wrong, everything downstream — sizing, risk, capital allocation — is wrong with it. Read this before you trust a single equity curve.


1. The Pitfalls That Kill 90% of Retail Backtests

1.1 Look-ahead bias (the silent killer)

Look-ahead bias is the use of information at time t that was not actually available at time t. In FX, it shows up in subtle ways:

  • Indicator computed mid-bar. You compute a 20-period SMA using close of the current bar, then trade at the open of the same bar. The bar's close didn't exist yet when you traded. Result: massively inflated win rate.
  • Higher-timeframe contamination. A 1H strategy that reads the current daily candle's high/low. The daily bar isn't closed until 5pm NY — you've embedded the whole day's range into a 9am decision.
  • Resample-then-shift errors. Pandas resample("1H").last() aligns to bar start; if you don't shift indicators by one bar, you're trading on data from the future.
  • Fill semantics. Limit fills assumed at the bar's low/high without checking sequence. If your stop was inside the bar and your TP was also inside the bar, which hit first? Tick data is the only honest answer.

Defensive pattern (pandas):

# WRONG — uses current bar to decide on current bar
df["sma"] = df["close"].rolling(20).mean()
df["signal"] = (df["close"] > df["sma"]).astype(int)
df["ret"] = df["signal"] * df["close"].pct_change()

# RIGHT — decide on t-1, execute at t open
df["sma"] = df["close"].rolling(20).mean().shift(1)
df["signal"] = (df["close"].shift(1) > df["sma"]).astype(int)
df["ret"] = df["signal"] * (df["open"].shift(-1) / df["open"] - 1)

Freqtrade ships a lookahead-analysis command that compares full-history vs incremental backtests; if results diverge, you have leakage. Run it before trusting anything.

1.2 Survivorship bias

In FX this is small but non-zero. Pairs do get delisted (TRY pre-2018 peg breaks, RUB suspended by most retail brokers in 2022, exotic crosses brokers drop quietly). If you backtest "all G10 + EM crosses" using a current symbol list, you're already biased. Treat the symbol universe as a time-indexed set, not a static list.

1.3 Overfitting / curve-fitting

The classic tell: equity curve climbs in-sample like a staircase, then breaks the moment you cross the out-of-sample boundary. If you optimized 8 parameters on 2 years of data, you fit noise.

Three rules:

  1. Fewer parameters than you think you need. Each free parameter roughly halves your effective sample.
  2. Parameter plateaus, not peaks. If SMA=27 works and SMA=26 or SMA=28 blow up, you found a hole in noise, not a signal.
  3. The Bailey/López de Prado "Backtest Overfitting" paper shows that with 7 years of daily data, you only need to try ~45 strategy variants before the expected best in-sample Sharpe exceeds 1.0 — purely by chance.

1.4 In-sample / out-of-sample discipline

Split rules:

  • Reserve at least 30% of the most recent data as untouched OOS. Never look at it during development.
  • If you peek even once and tweak, that data is now in-sample. Burn it and use new OOS.
  • For FX, prefer regime-spanning splits: train through 2015–2019, OOS 2020–2022 (COVID + low-rate), holdout 2023–2025 (hiking + carry regime).

1.5 Walk-forward analysis

Anchored or rolling-window re-optimization. Pseudocode:

for i, train_end in enumerate(train_end_dates):
    train = data[:train_end]
    test  = data[train_end : train_end + step]
    params = optimize(train)
    oos_returns[i] = run(test, params)

Then evaluate only the stitched OOS series. If your in-sample Sharpe is 2.5 but the stitched OOS Sharpe is 0.4, the strategy doesn't exist — you just keep refitting noise.

1.6 The multiple-testing problem

Run 1,000 random strategies on EURUSD. ~25 of them will show Sharpe > 2 with p < 0.05. They are noise. The fix is the Deflated Sharpe Ratio (§5).

1.7 Spread / slippage / commission

Ignoring costs is the single biggest source of backtest fraud. Concrete numbers:

Pair Typical retail spread Stress / news spread Slippage on market order
EURUSD 0.6–1.0 pip 3–8 pips 0.2–1 pip
GBPJPY 1.8–2.5 pip 8–20 pips 0.5–3 pip
Exotics (USDTRY) 8–30 pip 100+ pips 5–20 pip

A scalping strategy showing 20% annual with 0 cost can become -15% after 0.8 pip spread + 0.3 pip slippage. Always model:

  • Half-spread on entry, half-spread on exit (bid/ask asymmetry).
  • Latency slippage — quote you saw isn't the quote you get, even on ECN.
  • Funding/swap for overnight holds. For carry strategies this is the strategy.

1.8 Timezone bugs (the FX-specific landmine)

  • Most MT4/MT5 brokers run GMT+2/+3 so daily candles align to NY close (5pm EST).
  • HistData and Dukascopy use EET (GMT+2/+3) for daily aggregation by default.
  • TrueFX uses UTC.
  • OANDA's API defaults to UTC but their MT4 server is NY.

Mix two of these and you get 6 daily candles per week, a phantom Sunday bar, or a session filter that's off by an hour twice a year (DST). Always store ticks in UTC and apply session masks explicitly.


2. Backtesting Frameworks — Honest 2026 Comparison

Framework FX fit Speed Realism Maintained 2026 Learning curve
backtesting.py OK Medium Low Yes (light) Easy
vectorbt Good for research Extreme Low–medium Yes (active) Medium
nautilus_trader Excellent High Very high Yes (very active, bi-weekly) Hard
backtrader OK Slow Medium No — community only Medium
zipline-reloaded Poor (equities-first) Medium Medium Yes (community) Hard
QuantConnect / LEAN Excellent High (cloud) High Yes (active) Medium
MT4/MT5 Strategy Tester Native Medium Variable (99% only with proper tick CSV) Yes Easy → expert curve

backtesting.py — Great for first prototypes. Single-asset, no portfolio, no live trading. Use it for sanity checks on signal logic, not final validation.

vectorbt (and PRO) — King of parameter sweeps. Millions of backtests in seconds via NumPy + Numba. Catastrophic if your model has path dependence (trailing stops, dynamic sizing) — you'll need event-driven mode and you lose most of the speed advantage. FX gotcha: vectorbt assumes instantaneous fills at bar prices; you must manually inject spread/slippage as a cost vector.

nautilus_trader — Rust core, Python API, nanosecond resolution, identical code for backtest and live. Ships a simulated FX ECN venue with quote-tick support, fixed-margin model, and leverage handling. Best choice if you ever plan to go live. Steep curve — you'll spend a week on the data catalog and Cython types before your first run.

backtrader — Widely written about, but the original author stopped maintaining it years ago. Community forks exist but they're patchy. Avoid for new projects.

zipline-reloaded — Revival of Quantopian's engine. Heavy equities bias (Quantopian DNA), forex bolt-on works but feels grafted. Bundles and ingestion are a pain. Skip unless you specifically want the Pipeline API.

QuantConnect / LEAN — Open-source C# engine, Python API, free tick-level FX history back ~20 years. The "go live without leaving the platform" story is the cleanest in the industry. Downsides: cloud lock-in unless you self-host LEAN (which is doable but non-trivial), free tier compute is throttled.

MetaTrader Strategy Tester — Only honest at "every tick based on real ticks" mode with a proper third-party tick CSV import (Tickstory, Dukascopy → .fxt). Out-of-the-box "99% modeling quality" is misleading: the percentage measures interpolation completeness, not whether the underlying ticks match what you'd execute on. Useful if you're shipping an EA to a broker; do not trust it as a primary research environment.

Recommended stack: vectorbt for parameter sweeps → nautilus_trader (or LEAN) for high-fidelity validation → broker demo for forward test. Same strategy code on both endpoints is the only honest path.


3. FX Data Sources

Source Cost Resolution Coverage Spread data Quality
Dukascopy Free Tick (bid+ask) ~2003-present, ~70 pairs Yes (bid+ask separately) Industry reference for free data
HistData.com Free 1m, tick (1 pair/month/IP throttle) 2000-present, ~70 pairs Mid only (ask=bid usually) OK; some gaps, no weekend
TrueFX Free Tick (bid+ask) 2009-present, majors only (~10 pairs) Yes Excellent for what's there; gaps documented
OANDA v20 API Free w/ account Tick / candles ~2005-present (account-dependent) Bid+ask Single venue (your broker's quotes)
Polygon.io Paid ($79+/mo) Tick 2009-present Bid+ask Consolidated; good cross-checks
Tickdata.com Paid ($$$) Tick Decades Yes, multi-venue Institutional grade

Practical rules:

  1. Backtest on Dukascopy, validate on your broker's API. Dukascopy is a Swiss ECN — wider, more honest spreads than a market-maker retail broker. If your strategy works on Dukascopy ticks and your broker's recent quotes, it's robust to venue.
  2. TrueFX for majors-only HFT-style research. Their tick stream is the cleanest free interbank-style feed; great for spread-modeling sanity.
  3. Never mix sources in a single series. Dukascopy and OANDA disagree on every tick. Splice points create false signals.
  4. Always store with explicit bid/ask, never mid. Spread is a feature, not a nuisance.
  5. Use duka or dukascopy-node for ingestion. Both are maintained 2026 wrappers around Dukascopy's binary .bi5 format.

Storage format: Parquet partitioned by year/month/symbol, columns timestamp_utc_ns, bid, ask, bid_volume, ask_volume. Avoid CSV — at tick resolution EURUSD alone is hundreds of GB.


4. Walk-Forward & Cross-Validation Done Right

Why k-fold CV is broken for FX

Standard k-fold randomly assigns observations to folds. On a time series, this trains on the future to predict the past. Even sequential k-fold leaks because of:

  • Autocorrelation — adjacent bars share information.
  • Label horizon overlap — a label computed over a 5-day forward window at time t overlaps with the same label at t+1.

López de Prado's Purged K-Fold CV

From Advances in Financial Machine Learning (Wiley, 2018), Chapter 7. Two corrections:

  1. Purging — remove training observations whose labels overlap (in time) with the test set.
  2. Embargo — add a buffer (typically 0.01 * T to 0.02 * T) after each test fold before training data resumes, to neutralize serial correlation.

Sklearn-style implementation lives in mlfinlab (community fork) and López de Prado's own snippets.

Combinatorial Purged CV (CPCV)

Instead of one walk-forward path, generate C(N, k) train/test combinations, each producing a backtest path. With N=10 folds and k=2 test folds, you get 45 independent backtest paths. Aggregate statistics across paths give a far more honest picture than a single OOS run.

Anchored vs rolling walk-forward

  • Anchored: train window grows; OOS window slides forward. Better when the data-generating process is stationary.
  • Rolling: fixed-size train window slides. Better for FX, where regimes change (carry vs risk-off vs hiking cycles).

Default to rolling with re-optimization every 3–6 months of OOS.


5. Statistical Significance — Stop Lying With Sharpe

Lo (2002) — The Statistics of Sharpe Ratios

Andrew Lo's Financial Analysts Journal paper derived the asymptotic distribution of the Sharpe ratio estimator. The key result: for IID returns,

Var(SR_hat) ≈ (1 + 0.5 · SR²) / T

A backtest with SR = 1.5 over 500 daily observations (~2 years) has a 95% CI of roughly 0.7 to 2.3. That CI straddles "good strategy" and "noise." Two years is simply not enough.

For serially correlated returns, Lo's correction inflates the variance further — typical FX strategies have positive lag-1 autocorrelation, so naive Sharpe overstates significance.

Probabilistic Sharpe Ratio (PSR)

Bailey & López de Prado (2012):

PSR(SR*) = Φ( (SR_hat - SR*) · √(T-1) / √(1 - γ₃·SR_hat + ((γ₄-1)/4)·SR_hat²) )

Where γ₃ is sample skewness, γ₄ is sample kurtosis. Answer: "What is the probability that the true Sharpe exceeds threshold SR* given my observed SR, sample size, and return distribution?"

Deflated Sharpe Ratio (DSR)

The PSR threshold SR* is replaced by the expected maximum Sharpe under N null trials:

SR₀ = √Var(SR) · [ (1−γ)·Φ⁻¹(1 − 1/N) + γ·Φ⁻¹(1 − 1/(N·e)) ]

Where γ ≈ 0.5772 (Euler-Mascheroni) and N is the number of independent strategy variants you actually tried (estimate by clustering correlated trials). DSR collapses the multiple-testing problem and the non-normality problem into one number.

Rule of thumb: if you tested 100 strategy variants and your best has annualized SR = 1.5 over 2 years, DSR is often well below 0.95 — meaning not statistically significant.

Minimum Track Record Length

MinTRL = 1 + (1 − γ₃·SR₀ + ((γ₄−1)/4)·SR₀²) · (Φ⁻¹(confidence) / (SR − SR₀))²

A strategy with SR = 0.95 (annualized) typically needs ~3 years of daily returns to clear 95% confidence vs a zero-skill null. This matches the "3-year track record" hedge-fund convention; it isn't tradition, it's math.


6. Forward Testing / Paper Trading Protocol

Backtest → paper → live is non-negotiable. Suggested protocol:

Phase 1 — Demo, full strategy size, broker's live data feed.

  • Duration: minimum 3 months for swing/intraday, 1 month for higher-frequency (M5/M1) systems.
  • Minimum 100 trades spanning ≥2 distinct regimes (trending + ranging).
  • Track: slippage realized vs modeled, fill rate, spread paid vs modeled, trade-by-trade PnL deviation from backtest.

Phase 2 — Live, 25% of intended size.

  • 50+ live trades.
  • Reconcile: if live Sharpe is within ~30% of OOS backtest Sharpe and slippage is within 2x modeled, proceed.

Phase 3 — Scale to 50%, then 100%.

  • Step up only after each tranche delivers consistent statistics.
  • Never scale on the back of a winning streak — variance shrinks slower than ego inflates.

Kill criteria during forward test:

  • Drawdown exceeds 1.5x worst backtest DD.
  • Win rate drops more than 2 standard errors below backtest.
  • Average slippage exceeds 3x modeled.

7. FX-Specific Challenges

24/5 market with low-volume Sunday open

Spreads widen 3–10x between Friday 16:30 EST and Sunday 18:00 EST. Sunday's first hour is statistically toxic for mean-reversion strategies — gap risk dominates. Either filter it out or model it explicitly with a "Sunday open" spread multiplier.

Weekend gap risk

EURUSD gaps >50 pips happen ~1–3x/year; major-event gaps (Brexit, SNB unpeg) can exceed 1000 pips. Backtests that assume continuous price action will miss these. Two mitigations:

  • Model weekend gaps as a stochastic shock distribution sampled from historical Friday-close-to-Sunday-open jumps.
  • Cap weekend exposure to a fraction of weekday exposure (OANDA enforces this on retail accounts already).

Broker quote disagreement

Your Dukascopy tick backtest shows entry at 1.0832; your OANDA execution prints at 1.0834. Over thousands of trades this is not random — venues have systematic biases (OANDA is a market maker, Dukascopy is an ECN). Mitigations:

  • Buffer your entries: simulate fills at quote ± k·spread where k ≈ 0.5–1.0.
  • Run a "venue-shift" sensitivity: re-run the backtest with all prices shifted by ±0.3 pip. If returns collapse, your edge is below transaction noise.

Rollover / swap

Carry strategies live or die on swap accuracy. Triple swap on Wednesdays (for Mon settlement). Negative-rate currencies (CHF, JPY pre-2024) flip the sign. Get your broker's actual published swap table and bake it into the backtest — generic "interest differential" models miss broker spread on swaps (often 0.5–1.5% annualized).


8. References

  • Bailey, D. H. & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management 40(5): 94–107.
  • López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 7 (CV) and 11–14 (backtesting).
  • López de Prado, M. (2018). The 10 Reasons Most Machine Learning Funds Fail. Journal of Portfolio Management.
  • Lo, A. W. (2002). The Statistics of Sharpe Ratios. Financial Analysts Journal 58(4): 36–52.
  • Bailey, D. H., Borwein, J., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism. Notices of the AMS 61(5).

Bottom line: A 2-year backtest with Sharpe 1.5, optimized over 50 parameter combinations, on mid-prices, with no slippage, on a single venue — is approximately worthless. The number you actually need is DSR > 0.95 over an OOS walk-forward path, on bid/ask tick data, with broker-realistic costs, validated forward for 3+ months at scale. Anything less is gambling with confidence.

Pro feature
Trading Bible — full chapter

The 3 Plain-English chapters are free. The rest of the library unlocks with Pro — start a 7-day trial, cancel anytime.

Unlock with Pro · €79/mo
Cancel anytime · 7-day trial
← Previous
Event-driven FX strategies
Pre-FOMC drift, NFP, BoJ interventions, carry unwinds.
Next →
What 15 years of real data actually shows
The unflinching backtest results, including what didn't work.