AI FORECAST LEDGER
Public accountability ledger

Every AI forecast, graded against reality.

We log the loudest public predictions about AI, then grade each one against dated, linkable evidence. No vibes — only receipts.

Updated 2026-06  ·  every grade carries a dated, linkable receipt

The board over timeplaced by target date  ·  colour = verdict  ·  hover a marker for detail
6 claims1 Hit2 Overdue3 Pending1 graded
The ledger
Hitbenchmarks

An AI built before the contest earns a gold medal at the International Mathematical Olympiad by 2025

Hit · resolved by 2025

3 receiptsOpen verificationClose

Hit

Hit · resolved by 2025

IMO score vs. the gold cutoff (35/42 in 2025)Google DeepMind's Gemini Deep Think scored 35/42 — gold-medal standard — at the 2025 IMO

Verified blind: DeepSeek V4 + Grok — agree

git receipt: 98a195276d

How it's graded

Met when, at an IMO through 2025, an AI built beforehand scores at or above the gold-medal cutoff under competition conditions

Receipts · 3
  • An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.

    Google DeepMind archived↗  ·  2025-07-21

  • In July 2025, we reached gold medal-level performance on the International Mathematical Olympiad with a general-purpose reasoning model (35/42 points).

    OpenAI archived↗  ·  2026-02-20

  • So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.

    LessWrong — IMO challenge bet with Eliezer (Paul Christiano) archived↗context · not model-verified  ·  2022-02-25

Ledger history
  • 2026-06-24: seeded
  • 2026-06-25: demoted pending bulletproof re-grade
  • 2026-06-25: The claim targets 2025. Google DeepMind's official blog (source 0) confirms an advanced Gemini Deep Think scored 35/42 at IMO 2025, meeting the gold-medal cutoff. OpenAI's First Proof blog (source 1)
  • 2026-06-26: added the Yudkowsky–Christiano bet-odds source (LessWrong, >16% vs <8%) to the receipt; context only — claim, grade, and the 2-model verification binding are unchanged
Overduelabor

Deep learning outperforms radiologists within five years, so we should stop training them now

Overdue · deadline 2021 passed, awaiting grade

2 receiptsOpen verificationClose

Overdue

Overdue · deadline 2021 passed, awaiting grade

US radiologist workforce headcount vs. 2016US radiologist count rose 17.3% (30,723 to 36,024) from 2014 to 2023, amid a persistent shortage

How it's graded

Met if AI had displaced human radiologists (falling demand / headcount) by ~2021; failed if the workforce kept growing and AI became a complement

Receipts · 2
Ledger history
  • 2026-06-24: seeded
  • 2026-06-25: demoted pending bulletproof re-grade
Overduecapabilities

Within three to six months AI will be writing 90% of code, and within a year essentially all of it

Overdue · deadline Mar 2026 passed, awaiting grade

1 receiptOpen verificationClose

Overdue

Overdue · deadline Mar 2026 passed, awaiting grade

Share of production code generated by AI vs. written by humansContested: Amodei cites high internal use at Anthropic, but there is no verified industry-wide 90% figure and independent analyses dispute it

How it's graded

Met if, broadly across software, AI is generating ~90% of code by late 2025; failed if human-written code stays the clear majority of production software

Trajectory unverified indicators — not graded receipts
  • Sep 2025By late 2025: AI writing ~90% of code (the 3–6 month call) missedSix months on, no verified industry-wide 90% figure materialized; independent analyses put AI's share far lower.assessed Jun 2026indication↗
  • Mar 2026By Mar 2026: AI writing essentially all code (the 1-year call) missedA year on, human-written code remains the clear majority of production software; even Anthropic's internal 90% claim is disputed.assessed Jun 2026indication↗
Receipts · 1
  • Anthropic CEO Dario Amodei said AI could be writing '90% of the code' within three to six months.

    Yahoo Finance  ·  2025-03

Ledger history
  • 2026-06-24: seeded
  • 2026-06-25: demoted pending bulletproof re-grade
  • 2026-06-26: added trajectory checkpoints — both 3–6mo and 1-yr milestones missed
Pendingcapabilities

A superhuman coder exists by end of 2027

Pending · due Dec 2027 On track

2 receiptsOpen verificationClose

Pending

Pending · due Dec 2027

SWE-bench Verified %Top coding agents now exceed 85% on SWE-bench Verified (mid-2026), up from ~70% a year earlier — still short of autonomous senior-eng PRs

How it's graded

Met when a model autonomously completes a non-trivial PR end-to-end at senior-eng level

Trajectory unverified indicators — not graded receipts
  • Jun 2025Mid-2025: stumbling agents — first usable AI coding agents appear hitCoding agents emerged in 2025 and now autonomously resolve real GitHub issues on SWE-bench Verified.assessed Jun 2026indication↗
  • Jan 2026Early 2026: coding automation accelerates on paceTop agents now exceed 85% on SWE-bench Verified, up from ~70% a year earlier — fast progress, still short of autonomous senior-eng PRs.assessed Jun 2026indication↗
  • Dec 2027End 2027: a superhuman coder exists (the target) pendingThe claim's target milestone — not yet due.indication↗
Receipts · 2
Ledger history
  • 2026-06-20: seeded (#1)
  • 2026-06-24: replaced placeholder evidence with the SWE-bench Verified leaderboard; verdict held on-track
  • 2026-06-25: demoted pending bulletproof re-grade
  • 2026-06-26: added trajectory checkpoints (AI-2027 milestones, on track) + refreshed SWE-bench measurement to >85%
Pendingtimelines

The first weakly general AI system is publicly announced around 2028

Pending · due 2028

0 receiptsOpen verificationClose

Pending

Pending · due 2028

Public announcement date vs. the live community medianunknown — pending a direct read of the live Metaculus median

How it's graded

Met if a system meeting the Metaculus weakly-general-AI resolution criteria is publicly announced near the community median

Receipts

No verified evidence yet.

Ledger history
  • 2026-06-24: seeded; awaiting a verifiable read of the Metaculus median before grading
Pendingeconomy

Autonomous agents run measurable revenue by 2026

Pending · due Dec 2026

0 receiptsOpen verificationClose

Pending

Pending · due Dec 2026

attributed revenue %unknown

How it's graded

Met when a public company attributes >1% revenue to autonomous agents

Receipts

No verified evidence yet.

Ledger history
  • 2026-06-20: seeded (#1)
How grading works
Hitthe prediction came true on its terms. On tracktrending toward true before its date. At risktrending toward false, or contested. Missedits date passed and it did not come true. Overduedeadline passed, not yet graded. Pendingnot yet gradable, or evidence still unverified.

Tap any card to open its receipts. Logged from public predictions → graded against dated, linkable evidence → receipts kept in git, drafted by an agent and merged by a human.

Field Notes

Field Notes — commentary, not a graded receipt

How claims are verified

Every resolved claim requires at least 2 independent sources — each with an archived snapshot — before a verdict is assigned. Grading is done blind: DeepSeek V4 and Grok each evaluate independently; both models must agree. If they disagree the claim is marked contested and excluded from all headline stats until a human reviewer resolves it.

Each graded claim carries a git receipt that binds the claim text and its cited sources to a commit in version history; grade integrity is additionally enforced by the validation gate and the per-model verification records.

Unverified and contested claims appear on the board for transparency but are not counted in the accuracy score. Trajectory checkpoints are dated indications of progress, not graded verdicts — they never affect the accuracy score.