Lead-Lag for Feature Selection

Demo of my TFG. We visualize the lead-lag relationship between time series and how it can be used for feature selection in machine learning.

Project Background

Forecasting noisy financial time‑series becomes easier once we recognise that some variables consistently move first. My bachelor thesis shows that, by detecting these leader–lagger pairs, shifting the leader forward by its lag \(\ell\), and feeding the aligned data into CatBoost, one can cut mean‑absolute‑error by roughly one‑third on synthetic panels and lift out‑of‑sample \(R^{2}\) on daily S&P 500 stocks.

The demo below recreates the perfect‑correlation sandbox used in the thesis: the red series \(X_t\) perfectly anticipates the blue target \(Y_t\) after \(\ell\) steps. Slide the Lag and Horizon controls and watch the gap collapse whenever \(\ell \le h\).

What Is a Lead–Lag Relationship?

Given two series \(X_t\) and \(Y_t\), we say that X leads Y by \(\ell\) if \(Y_{t+\ell}=X_t\) (or, in practice, if their differenced returns show a peak cross‑correlation at lag \(\ell\)). Intuitively, every wiggle you see in the leader re‑appears in the lagger after a fixed delay.

Why Leveraging It Improves Forecasts

If today’s leader value already contains tomorrow’s target information, conditioning on the leader shrinks the uncertainty set for the forecast. In the toy GBM derivation from the thesis, knowing \(X_t\) lowers the conditional variance of \(Y_{t+\ell}\) by a factor \(1-\rho^2\); with correlations above 0.9, this almost eliminates the noise.

Visual Lead–Lag Explorer

Visual Lead–Lag Explorer

Slide the controls below to adjust the lag ℓ, prediction horizon h, zoom window, and time τ.

Problem Statement & Metric

We study a single leading series \(X_t\) that perfectly anticipates the target \(Y_t\) by a fixed lag \(\ell\) so that \(Y_{t+\ell}=X_t\). The sliders above let you vary \(\ell\) (Lag), the forecast horizon \(h\) and the zoom window to visualise how the lead–lag gap affects predictability.

Directionality is quantified through the balanced magnitude‑weighted lead‑lag score \(S^{\mathrm{final}}_{XY}(\ell)\), which rewards both high absolute correlation and clear asymmetry.

Three‑Stage Pipeline

  1. Candidate discovery – compute \(S^{\mathrm{final}}_{XY}(l)\) across lags and keep the maximiser.
  2. Rolling stability check – verify the lead persists in 1‑year windows; discard erratic links.
  3. Feature extraction → CatBoost – shift \(X_t\) by the selected lag, build the design matrix and fit a CatBoost regressor using iterations=500, depth=6, learning_rate=0.05.

Results & Visual Insights

  • Synthetic panels: MAE ↓ ≈ 33 % on average; when the structural lag exceeds the forecast horizon, MAE gains soar to ~70 %.
  • S&P 500 (2010‑2020): daily MAE drops 15 %, RMSE 10 %; out‑of‑sample \(R^{2}\) climbs 0.37 → 0.49.
  • This visual sandbox: as soon as the slider satisfies \(\ell\le h\), predictions become nearly deterministic – mirroring the thesis’ Case 1 results.

Skills Demonstrated

  • Formal lead–lag modelling & metric design.
  • Deterministic, linear‑time feature filtering.
  • Gradient‑boosted forecasting with CatBoost.
  • Interactive Plotly visualisation & responsive frosted‑glass UI.