Continuous Program Search (CPS) proposes evolving trading strategies inside a neural network’s latent space, steered by a learned mutation model. Over eight weeks we rebuilt the whole system from the paper text alone and tested every headline claim — then kept going and stress-tested the idea of strategy search itself. This page is the plain-language account of what we did and what we found. The full technical dossier — every experiment with parameters, deviations, and evidence — lives at cps.fincli.ai.
A trading strategy can be written as a small program: “go long when the 20-day average crosses above the 100-day average,” and so on. The paper’s idea is to make that discrete world continuous so a machine can search it smoothly.
Concretely, a strategy is four signal blocks — when to enter and exit a long, when to enter and exit a short — each a small comparison of indicators. Here is the classic moving-average crossover in the paper’s program language:
STRATEGY ma_cross_20_100 LONG-ENTRY : SMA(close, 20) > SMA(close, 100) ← fast average crosses above slow → buy LONG-EXIT : SMA(close, 20) < SMA(close, 100) SHORT-ENTRY : SMA(close, 20) < SMA(close, 100) ← and the mirror for shorts SHORT-EXIT : SMA(close, 20) > SMA(close, 100)
Hold this example in mind — it ends up carrying the whole story. This exact rule family is one of the
only honest survivors in Part 5, and the strategy the paper’s latent space turns out to be unable
to represent: encode it to z and decode it back, and both legs come out as SMA(close, 20) —
a rule that compares a thing to itself and never trades.
Claim 1 — the VAE works (Table 2). Sample random points in latent space and ~97.5% decode into valid, runnable strategy programs. This is the foundation: without it there is no space to search.
Claim 2 — learned mutations beat random ones (Table 1). Strategies evolved with the learned “flow” operator reach a median out-of-sample Sharpe of 1.152, vs 1.005 for plain random (isotropic) mutation and 0.890 for a structured variant (“dual”) — while using only 13.7% of the evaluation budget. This is the paper’s central result.
Sharpe ratio, if you need it: risk-adjusted return — average return divided by its volatility. Zero means no edge; a sustained out-of-sample Sharpe above ~1 is genuinely good. The catch, and the theme of this whole page: an in-sample Sharpe above 1 is easy — you just have to try enough things (Part 5 lets you feel this).
Paper: “Continuous Program Search” (arXiv 2602.07659). No author code was available — everything below is reimplemented from the PDF.
A full reimplementation: tokenizer, transformer VAE (verified hyperparameter-for-hyperparameter identical to the paper’s §4.1 architecture), the rectified-flow mutation model, the evolution loop, and a walk-forward backtesting harness over five futures markets. Verdict-deciding runs were pre-registered — the pass/fail rule written down before the run — and every deviation from the paper is documented in the dossier.
| Latent size | Paper (Table 2) | Ours |
|---|---|---|
| d = 16 | 89.6% | 100% |
| d = 32 | 95.4% | 98.8% |
| d = 64 | 97.5% | 100% |
| d = 128 | 97.8% | 98.9–100% |
The catch that took a month to find: the result depends entirely on the training corpus, not the architecture. On a paper-style corpus of simple strategies, the unmodified recipe reaches 100% validity. On our deeper, more complex corpus the very same recipe collapses to 0–76% — we needed one documented training deviation (a σ-floor) to recover it. Corpus depth was the whole gap; the architecture is faithful either way.
| Run | Flow − Iso (best Sharpe) |
|---|---|
| seed 900001 | −0.051 |
| seed 900002 | −0.140 |
| seed 900003 | −0.263 |
| seed 900004 | −0.273 |
| mean | −0.151 |
Axes differ: the paper reports a median across runs (+0.147 implied edge, shown as the blue reference); it does not publish per-run deltas. Ours are per-run, pre-registered, 0/4 positive.
The one apparent positive dissolved on contact. Early on, a weakly-trained flow showed a +0.22 median edge on one market — the closest we ever got to the paper’s result. Retraining the same flow properly (better fit, lower validation loss) erased the edge entirely. That is the signature of a regularizer — an accidental “don’t move too far” brake — not of a model that has learned which direction is better. The budget claim fails too: in our traces, plain random mutation reached strong solutions in fewer evaluations, not 7× more.
The learned operator is trained on records of past mutations: “from code z, with context φ, the step Δz improved fitness.” We asked the direct question — is the improving step predictable from the state? Across all 104,907 recorded mutations the answer is no: predicting Δz from (z, φ) does no better than predicting zero, in all 128 latent dimensions, on both corpora we built. A model trained on directionless data can’t supply direction — and you can see it behave accordingly:
A null like this is only convincing if you try hard to break it. We closed every escape hatch we — or the paper — could think of:
Each row is a separate pre-registered experiment with artifacts in the dossier. One residual we can’t close from outside: the authors’ exact checkpoint. Our architecture is verified identical to theirs, so the residual is “luckier training run,” not a spec difference.
If the operator isn’t the bottleneck, what is? We built the harness the field says you should use and rarely does: nested walk-forward evaluation (train → validate → test with a 10-day embargo, no lookahead) plus the Deflated Sharpe Ratio — which discounts a backtest by how many things you tried before picking it. Then we fed it everything.
Backtest a bunch of strategies with no real edge at all — coin flips dressed up as trading rules — and keep the best one. How good does that best backtest look? Drag the slider. This expected-maximum curve is the exact quantity the Deflated Sharpe Ratio subtracts back out.
| Strategies tried (N) | Expected best backtest Sharpe |
|---|---|
| 10 | 0.50 |
| 100 | 0.80 |
| 1,000 | 1.03 |
| 1,320 | 1.05 |
| 2,000 | 1.09 |
1,320 is the true trial count behind our archived search champions — at that N, pure luck is expected to hand you a Sharpe-1.05 backtest. That is the bar every searched champion was held to in the funnel above, and why none survived it. The hand-written rules passed because for them N = 1: nobody went shopping for them.
Why couldn’t the search even find the euro-FX trend rules that survive? Because the latent space can’t represent them. Every surviving rule compares an indicator at two different periods (say, 20-day vs 100-day averages). A probe shows the VAE’s per-signal code stores the indicator type and roughly one period — decode “MA-cross 20/100” and it re-emits 20 for both legs (first period recovered 4/4, second 0/4), a tautology that trades on nothing. Round-trip the four survivors through the latent and 4/4 alive becomes 1/4.
And it’s not our bug. We cross-checked every architecture hyperparameter against the paper’s §4.1: identical, including the 32-dim per-signal block where the capacity runs out. The published architecture cannot faithfully represent the very strategies that work. That one defect quietly explains a lot: the missing directional field, the search’s failure on euro-FX, and why seeding through the latent was doomed.
The four surviving trend rules look like a modest, honest edge on euro-FX. Condition them on market regime and the edge concentrates sharply — in every surviving cell:
Found 2026-07-03 by our automated research loop (below) — its first externally-sourced proposal, tested end-to-end. The “6E trend premium” is really a falling-euro/high-vol phenomenon, invisible to a flat backtest read.
No author code. Tokenizer, VAE, flow operator, evolution loop, and backtester rebuilt from the paper text; every ambiguity logged as a question.
The paper-literal recipe yields 45–76% valid programs on our corpus, not 97.5%. Weeks of controlled ablations rule out loss conventions, schedules, checkpoints.
A one-line training-time floor on the posterior width lifts validity from 76% → 98.8% without collapsing diversity. All four Table-2 anchors eventually beaten.
The unmodified paper recipe reaches 100% on a paper-style (simpler) corpus. Corpus depth was the entire gap. Same day: the pre-registered paper-faithful operator test lands — null, 0/3 seeds.
E[Δz | z, φ] ≈ 0: improving steps are unpredictable from state on both corpora. The learned operator has nothing to learn. A brief +0.22 flicker on an undertrained flow becomes the “regularizer, not compass” diagnosis.
Nested walk-forward + Deflated Sharpe built (189 tests). Archived champions: 0/75 survive. Fresh CMA-ES vs random: no edge, 0/20 survive. Hand-authored trend rules: 4/40 survive, all euro-FX.
Latent round-trip kills the survivors (period collapse) — and the defect is the paper’s own architecture. AWR, on-policy, ρ-conditioning (EXP-08), seeded robust-selection search (EXP-07): all null. Verdict: drop the latent; optimization itself is the enemy.
The project becomes an autonomous research pipeline: an arXiv scout proposes techniques, a feasibility gate filters them, minimal decisive tests run against pre-registered bars, survivors go to a shadow ledger. Its first proposal produced the regime-concentration finding above.
Everything above is backed by a numbered record — question, parameters, pre-registered rule, outcome, artifacts. Each opens in full in the technical dossier.
The validation harness. The single keeper. Nested walk-forward + Deflated Sharpe at the true trial count, with cost stress — it correctly rejected all 120 searched champions and passed a small set of honest ones. Any strategy we ever run money against goes through it.
A narrow, regime-contingent edge. Long-horizon trend on euro-FX futures — held with the asterisk it earned, tracked in a paper-trading shadow ledger, never searched-over.
A method. Replication-first, pre-registered verdicts, every null written down so it can’t be re-litigated by accident. The whole loop now runs semi-autonomously: scout → feasibility → decisive minimal test → operator judgment → shadow ledger.
The authors’ exact checkpoint (our residual is “same architecture, luckier training run”); the paper’s exploration-noise term σout>0, never run in any arm; Table-2’s novelty/consistency columns; and the cost-convention re-score. Author questions are collected in the dossier, which also exposes a machine-readable data.json.