We reproduced an AI-trading paper. The most valuable result is what didn’t survive.

Part 1

The paper, in sixty seconds

A trading strategy can be written as a small program: “go long when the 20-day average crosses above the 100-day average,” and so on. The paper’s idea is to make that discrete world continuous so a machine can search it smoothly.

Strategy programentry/exit rules built from indicators

→

Encode (VAE)compress the program into 128 numbers, “z”

→

Mutate znudge the code — randomly, or with a learned model

→

Decodeturn the new z back into a program

→

Backtest & selectkeep the fittest, repeat for 100 generations

The highlighted step is the paper’s contribution: replace random nudges with a “flow” model trained on past mutations, so evolution moves in promising directions.

What a “strategy program” actually looks like

Concretely, a strategy is four signal blocks — when to enter and exit a long, when to enter and exit a short — each a small comparison of indicators. Here is the classic moving-average crossover in the paper’s program language:

STRATEGY ma_cross_20_100
  LONG-ENTRY  : SMA(close, 20)  >  SMA(close, 100)   ← fast average crosses above slow → buy
  LONG-EXIT   : SMA(close, 20)  <  SMA(close, 100)
  SHORT-ENTRY : SMA(close, 20)  <  SMA(close, 100)   ← and the mirror for shorts
  SHORT-EXIT  : SMA(close, 20)  >  SMA(close, 100)

Hold this example in mind — it ends up carrying the whole story. This exact rule family is one of the only honest survivors in Part 5, and the strategy the paper’s latent space turns out to be unable to represent: encode it to z and decode it back, and both legs come out as SMA(close, 20) — a rule that compares a thing to itself and never trades.

The two headline claims we set out to check

Claim 1 — the VAE works (Table 2). Sample random points in latent space and ~97.5% decode into valid, runnable strategy programs. This is the foundation: without it there is no space to search.

Claim 2 — learned mutations beat random ones (Table 1). Strategies evolved with the learned “flow” operator reach a median out-of-sample Sharpe of 1.152, vs 1.005 for plain random (isotropic) mutation and 0.890 for a structured variant (“dual”) — while using only 13.7% of the evaluation budget. This is the paper’s central result.

Sharpe ratio, if you need it: risk-adjusted return — average return divided by its volatility. Zero means no edge; a sustained out-of-sample Sharpe above ~1 is genuinely good. The catch, and the theme of this whole page: an in-sample Sharpe above 1 is easy — you just have to try enough things (Part 5 lets you feel this).

Paper: “Continuous Program Search” (arXiv 2602.07659). No author code was available — everything below is reimplemented from the PDF.

Part 2

What we built to test it

A full reimplementation: tokenizer, transformer VAE (verified hyperparameter-for-hyperparameter identical to the paper’s §4.1 architecture), the rectified-flow mutation model, the evolution loop, and a walk-forward backtesting harness over five futures markets. Verdict-deciding runs were pre-registered — the pass/fail rule written down before the run — and every deviation from the paper is documented in the dossier.

8

numbered experiment records (EXP-01…08), plus dozens of ablations

104,907

recorded mutations in the operator-training corpus

189

tests in the honest-validation harness we built for Part 4

5

futures markets (silver, crude, nat-gas, euro-FX, S&P), daily data 2008–2025

8 wks

May → July 2026, one operator + AI agents + a rented spot-GPU fleet

<$1

compute cost of several decisive verdicts (spot GPUs are cheap; rigor is the expensive part)

Part 3

Scoreboard: claim by claim

Claim 1 — VAE validity ✓ Reproduced

“~97.5% of random latent samples decode into valid programs.”

Prior-sample validity by latent size — ours meets or beats every anchor

% of 256 random latent samples that decode into valid strategy programs

Paper (Table 2) Ours

d = 16

89.6 100

d = 32

95.4 98.8

d = 64

97.5 100

d = 128

97.8 98.9

85%90% 95%100%

View as table

Latent size	Paper (Table 2)	Ours
d = 16	89.6%	100%
d = 32	95.4%	98.8%
d = 64	97.5%	100%
d = 128	97.8%	98.9–100%

The catch that took a month to find: the result depends entirely on the training corpus, not the architecture. On a paper-style corpus of simple strategies, the unmodified recipe reaches 100% validity. On our deeper, more complex corpus the very same recipe collapses to 0–76% — we needed one documented training deviation (a σ-floor) to recover it. Corpus depth was the whole gap; the architecture is faithful either way.

Claim 2 — learned mutations beat random ✗ Not reproduced

“Flow 1.152 > Iso 1.005 > Dual 0.890 median out-of-sample Sharpe, at 13.7% of the budget.”

What the paper claims

Median out-of-sample Sharpe, Table 1

1.152

1.005

0.890

Flow
(learned)

Iso
(random)

Dual

What we measured — flow minus random, 4 pre-registered runs

Δ best Sharpe per run (flow − iso), paper-faithful config; every run negative

paper’s implied edge ≈ +0.147

0

seed 1

−0.051

seed 2

−0.140

seed 3

−0.263

seed 4

−0.273

View as table

Run	Flow − Iso (best Sharpe)
seed 900001	−0.051
seed 900002	−0.140
seed 900003	−0.263
seed 900004	−0.273
mean	−0.151

Axes differ: the paper reports a median across runs (+0.147 implied edge, shown as the blue reference); it does not publish per-run deltas. Ours are per-run, pre-registered, 0/4 positive.

The one apparent positive dissolved on contact. Early on, a weakly-trained flow showed a +0.22 median edge on one market — the closest we ever got to the paper’s result. Retraining the same flow properly (better fit, lower validation loss) erased the edge entirely. That is the signature of a regularizer — an accidental “don’t move too far” brake — not of a model that has learned which direction is better. The budget claim fails too: in our traces, plain random mutation reached strong solutions in fewer evaluations, not 7× more.

Part 4

Why it fails: there is no direction to learn

The learned operator is trained on records of past mutations: “from code z, with context φ, the step Δz improved fitness.” We asked the direct question — is the improving step predictable from the state? Across all 104,907 recorded mutations the answer is no: predicting Δz from (z, φ) does no better than predicting zero, in all 128 latent dimensions, on both corpora we built. A model trained on directionless data can’t supply direction — and you can see it behave accordingly:

How far each operator actually moves the code

Average latent step length ‖Δz‖ when deployed

Random mutation (iso)

1.13

Learned flow operator

≈0.19 — barely moves: a brake, not a compass

View as table

Operator	Avg ‖Δz‖
Random (iso)	1.13
Learned flow	0.16–0.20

A null like this is only convincing if you try hard to break it. We closed every escape hatch we — or the paper — could think of:

“The flow was undertrained.”

✗ Null

Retrained to the best validation loss we ever achieved → the apparent edge disappears; ties random, loses to dual.

“Your extra fitness-weighting term hurt it.”

✗ Null

Removed it (λ=0) across seeds → flow stays at the fitness floor.

“The latent space was too broken (76% validity).”

✗ Null

Re-ran everything on the fixed 98.8–100%-validity substrate → same null; the probe says the improving step is unpredictable there too.

“Weight the training corpus toward improving moves (AWR).”

✗ Null

Conditional-mean drift pinned at ~0.20 vs random’s 1.13 across every temperature — there is no directional field to amplify.

“Generate the corpus from the operator’s own moves (on-policy / DAgger).”

✗ Null

Creating the data instead of amplifying it doesn’t help: drift ~0.12; the one hopeful metric was an artifact of its own shuffled control.

“You never fed it ρ, the requested step size (the last faithfulness gap).”

✗ Null

EXP-08: ρ is mathematically redundant given φ; a flow with ρ as a real input fits no better and its step is flat across every ρ bin.

“Use a stronger optimizer instead (CMA-ES).”

✗ Null

20 runs, GPU-decoded, honest harness: no edge over plain random mutation. The operator was never the bottleneck.

Each row is a separate pre-registered experiment with artifacts in the dossier. One residual we can’t close from outside: the authors’ exact checkpoint. Our architecture is verified identical to theirs, so the residual is “luckier training run,” not a spec difference.

Part 5

The bigger finding: honest validation kills everything searched

If the operator isn’t the bottleneck, what is? We built the harness the field says you should use and rarely does: nested walk-forward evaluation (train → validate → test with a 10-day embargo, no lookahead) plus the Deflated Sharpe Ratio — which discounts a backtest by how many things you tried before picking it. Then we fed it everything.

Survivors under honest validation

Strategies passing nested walk-forward + Deflated Sharpe (at the true trial count) + cost stress

Archived search champions (every method)

0 / 75 survived

Fresh latent search, CMA-ES + random

0 / 20

Latent search on euro-FX — the one market with signal

0 / 10

Program search seeded with the winning rules, selecting on robustness

0 / 15

Hand-written classic trend rules, tested once, never optimized

4 / 40

View as table

Candidate pool	Survivors	Rate
Archived search champions (17 iso + 48 shuffled + 10 SI/CL)	0 / 75	0%
Fresh latent search (CMA-ES + iso, 2 markets)	0 / 20	0%
Latent search on euro-FX (the signal-bearing market)	0 / 10	0%
Seeded, robustness-selected program search (EXP-07)	0 / 15	0%
Hand-authored trend/MA-cross rules, n_trials = 1	4 / 40	10%

“Optimizing in-sample fitness is the enemy. Search inflates apparent fitness by trying hundreds of variations; deflation takes exactly that gain back. The only strategies that survived were the ones nobody optimized.” We tested each axis separately — the operator (flow ≈ CMA-ES ≈ random), the search space (latent and raw programs), the selection rule (even selecting on robustness during search), and seeding the search with known-good rules. Null on every axis, including on the one market where signal provably exists. A search that starts from surviving rules optimizes them back into overfit ones. And the negative control passed: on a no-signal market the harness rejected everything — it isn’t just cynical.

Feel the mechanism: how good does zero skill look?

Backtest a bunch of strategies with no real edge at all — coin flips dressed up as trading rules — and keep the best one. How good does that best backtest look? Drag the slider. This expected-maximum curve is the exact quantity the Deflated Sharpe Ratio subtracts back out.

Expected best backtest Sharpe among N zero-edge strategies

Ten years of daily data; expected maximum of N independent tries

View as table

Strategies tried (N)	Expected best backtest Sharpe
10	0.50
100	0.80
1,000	1.03
1,320	1.05
2,000	1.09

1,320 is the true trial count behind our archived search champions — at that N, pure luck is expected to hand you a Sharpe-1.05 backtest. That is the bar every searched champion was held to in the funnel above, and why none survived it. The hand-written rules passed because for them N = 1: nobody went shopping for them.

The root cause, found in the latent

Why couldn’t the search even find the euro-FX trend rules that survive? Because the latent space can’t represent them. Every surviving rule compares an indicator at two different periods (say, 20-day vs 100-day averages). A probe shows the VAE’s per-signal code stores the indicator type and roughly one period — decode “MA-cross 20/100” and it re-emits 20 for both legs (first period recovered 4/4, second 0/4), a tautology that trades on nothing. Round-trip the four survivors through the latent and 4/4 alive becomes 1/4.

And it’s not our bug. We cross-checked every architecture hyperparameter against the paper’s §4.1: identical, including the 32-dim per-signal block where the capacity runs out. The published architecture cannot faithfully represent the very strategies that work. That one defect quietly explains a lot: the missing directional field, the search’s failure on euro-FX, and why seeding through the latent was doomed.

Even the survivor comes with an asterisk

The four surviving trend rules look like a modest, honest edge on euro-FX. Condition them on market regime and the edge concentrates sharply — in every surviving cell:

One regime carries the euro-FX trend premium

The falling-euro / high-volatility regime, share of days vs share of profits (all 4 surviving cells)

Share of trading days

32%

Share of profits

72–77%

View as table

Regime (momentum-down / high-vol)	Value
Share of out-of-sample days	32%
Share of total PnL (range across 4 cells)	72–77%
Other 3 regimes	flat to negative

Found 2026-07-03 by our automated research loop (below) — its first externally-sourced proposal, tested end-to-end. The “6E trend premium” is really a falling-euro/high-vol phenomenon, invisible to a flat backtest read.

Part 6

How it unfolded

Early May 2026

Reproduction starts from the PDF

No author code. Tokenizer, VAE, flow operator, evolution loop, and backtester rebuilt from the paper text; every ambiguity logged as a question.

May → early June

The validity fight

The paper-literal recipe yields 45–76% valid programs on our corpus, not 97.5%. Weeks of controlled ablations rule out loss conventions, schedules, checkpoints.

Jun 04

σ-floor breakthrough — 98.8%

A one-line training-time floor on the posterior width lifts validity from 76% → 98.8% without collapsing diversity. All four Table-2 anchors eventually beaten.

Jun 10

Validity mystery solved: it was the corpus

The unmodified paper recipe reaches 100% on a paper-style (simpler) corpus. Corpus depth was the entire gap. Same day: the pre-registered paper-faithful operator test lands — null, 0/3 seeds.

Jun 11–13

The mechanism probe

E[Δz | z, φ] ≈ 0: improving steps are unpredictable from state on both corpora. The learned operator has nothing to learn. A brief +0.22 flicker on an undertrained flow becomes the “regularizer, not compass” diagnosis.

Jun 17

The honest harness — and everything dies

Nested walk-forward + Deflated Sharpe built (189 tests). Archived champions: 0/75 survive. Fresh CMA-ES vs random: no edge, 0/20 survive. Hand-authored trend rules: 4/40 survive, all euro-FX.

Jun 17–18

Root cause + last hatches closed

Latent round-trip kills the survivors (period collapse) — and the defect is the paper’s own architecture. AWR, on-policy, ρ-conditioning (EXP-08), seeded robust-selection search (EXP-07): all null. Verdict: drop the latent; optimization itself is the enemy.

Jul 02–03

The pivot goes live

The project becomes an autonomous research pipeline: an arXiv scout proposes techniques, a feasibility gate filters them, minimal decisive tests run against pre-registered bars, survivors go to a shadow ledger. Its first proposal produced the regime-concentration finding above.

The paper trail: eight experiment records

Everything above is backed by a numbered record — question, parameters, pre-registered rule, outcome, artifacts. Each opens in full in the technical dossier.

EXP-01

Paper-style VAE reproduction (Table 2 anchors)

✓ Matched

EXP-02

Paper-faithful flow vs random mutation, median re-score

✗ Null

EXP-03

Conditional-mean probe: is the improving step predictable?

⚠ E[Δz]≈0

EXP-04

Honest-test of the one positive: regularizer, not direction

⚠ Artifact

EXP-05

Direction ablation: does the learned direction add anything?

✗ Zero signal

EXP-06

Validation harness + operator bake-off (CMA-ES vs random)

✗ 0/20 survive

EXP-07

Seeded program search selecting on robustness itself

✗ 0/15 survive

EXP-08

ρ-conditioning — the last faithfulness gap, closed

✗ Null

Part 7

Where this leaves us

What survived

The validation harness. The single keeper. Nested walk-forward + Deflated Sharpe at the true trial count, with cost stress — it correctly rejected all 120 searched champions and passed a small set of honest ones. Any strategy we ever run money against goes through it.

A narrow, regime-contingent edge. Long-horizon trend on euro-FX futures — held with the asterisk it earned, tracked in a paper-trading shadow ledger, never searched-over.

A method. Replication-first, pre-registered verdicts, every null written down so it can’t be re-litigated by accident. The whole loop now runs semi-autonomously: scout → feasibility → decisive minimal test → operator judgment → shadow ledger.

Still open

The authors’ exact checkpoint (our residual is “same architecture, luckier training run”); the paper’s exploration-noise term σ_out>0, never run in any arm; Table-2’s novelty/consistency columns; and the cost-convention re-score. Author questions are collected in the dossier, which also exposes a machine-readable data.json.

Part 8

Fair questions

Are you saying the paper is wrong, or worse?

Neither is claimed. We reproduced one of its two headline results cleanly and failed to reproduce the other despite closing every gap we could find — including verifying our architecture is identical to theirs, hyperparameter for hyperparameter. The one thing we cannot test from outside is their exact trained checkpoint; “same architecture, luckier training run” remains possible. Our open questions for the authors are listed in the dossier, which is written to be answerable, not adversarial.

Why should anyone trust a null result?

Because of how it was produced: verdict rules pre-registered before the deciding runs; negative controls that passed (on a no-signal market the harness correctly rejected everything); seven independent rescue attempts for the operator, each pre-registered and each null; and artifacts published for every experiment. A null you tried hard to break is evidence. A null you wanted is just a mood.

What would change your mind?

The authors’ checkpoint showing Flow > Iso under our harness; any corpus — theirs or anyone’s — where the improving step E[Δz|z,φ] is measurably predictable; or a searched strategy from any method surviving nested walk-forward plus Deflated Sharpe at its true trial count. Each is a concrete, runnable test, and we’d run any of them.

Is the surviving euro-FX trend edge tradable?

It’s modest, it’s concentrated in one regime (falling euro, high volatility — 72–77% of the profit from 32% of the days), and it survived costs and deflation only because nobody optimized it. We track it in a paper-trading shadow ledger rather than claiming it. Nothing on this page is investment advice.

Why does this matter beyond one paper?

Because the failure mode isn’t specific to this paper. Any pipeline that searches over strategies and reports the best backtest — genetic programming, latent-space evolution, LLM-generated strategies, a grad student with a for-loop — manufactures Sharpe out of trial count (Part 5’s slider is the whole mechanism). The transferable result is the harness: deflate by what you actually tried, validate nested, and expect most of the literature’s edges to evaporate.

Glossary

VAE — an autoencoder: a neural net that compresses a thing into a short code (here, 128 numbers) and decompresses it back.

Latent space — the space of those codes; “nearby” codes should decode to similar strategies.

Isotropic mutation — the random baseline: nudge the code by plain directionless noise.

Flow / GCM — the paper’s learned mutation model, trained on records of past mutations.

Sharpe ratio — return per unit of risk; 0 = no edge, sustained out-of-sample >1 = very good.

Out-of-sample (OOS) — performance on data the strategy was never fit to — the only kind that counts.

Walk-forward, nested — evaluate in rolling train→validate→test windows with an embargo gap, so no information leaks backward.

Deflated Sharpe (DSR) — a Sharpe estimate discounted by how many candidates were tried before picking this one.