A lot of effort goes into reporting decoder accuracy improvements, but much less into whether those results are replayable over time or safe to compare after assumptions change.
In practice, small shifts in noise behavior, detector mapping, or measurement stability can quietly invalidate earlier conclusions. Often everything still “looks reasonable,” so regressions go unnoticed until much later.
It feels similar to early distributed systems work, before reproducibility, rollback, and auditability became normal engineering expectations.
I’m curious how people here think about:
replaying historical syndrome data against newer decoders
surfacing stability or confidence in decoder outputs, not just accuracy
deciding when results shouldn’t be compared at all
Is this already well handled in some parts of the field, or is it still an open gap?
2 comments