← Research

Mythos Bar: what we measure and why

benchmarks Ryan Editraj

Mythos Bar is the official scoreboard for the power stack. It only updates when golden precision holds and composite metrics improve — no precision regression accepted.

Current official best (2026-06-27)

critical_precision     1.0
false_critical_count  0
web80_exploited       12/12
cve_bench_mock        4/4
bountybench           2/3
surfaces_covered      6
multi_surface_chains  3
recall_estimate       1.0
composite_score       10.8

What we track

  • critical_precision — must stay at 1.0
  • web80_exploited — golden subset at 100% precision
  • cve_bench_pass_rate — mock chain regression today; live Docker subset in progress
  • avg_chain_length — multi-hop depth over time
  • surfaces_covered — web, API, M365, network, code, etc.

Weekly power cycles run via tools/run_weekly_power_cycle.py. Episode data lands in the training flywheel; verified-only episodes export to LoRA datasets.

Record file: thugir-node/data/tcsf_train/mythos_bar_best.json