Hitbenchmarks

An AI built before the contest earns a gold medal at the International Mathematical Olympiad by 2025

by Eliezer Yudkowsky (bet vs. Paul Christiano) · called 2022-02

Hit · resolved by 2025

3 receiptsOpen verificationClose▾

Hit

Hit · resolved by 2025

IMO score vs. the gold cutoff (35/42 in 2025)Google DeepMind's Gemini Deep Think scored 35/42 — gold-medal standard — at the 2025 IMO

Verified blind: DeepSeek V4 + Grok — agree

git receipt: 98a195276d↗

How it's graded

Met when, at an IMO through 2025, an AI built beforehand scores at or above the gold-medal cutoff under competition conditions

Receipts · 3

An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.
Google DeepMind archived↗ · 2025-07-21
In July 2025, we reached gold medal-level performance on the International Mathematical Olympiad with a general-purpose reasoning model (35/42 points).
OpenAI archived↗ · 2026-02-20
So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.
LessWrong — IMO challenge bet with Eliezer (Paul Christiano) archived↗context · not model-verified · 2022-02-25

Ledger history

2026-06-24: seeded
2026-06-25: demoted pending bulletproof re-grade
2026-06-25: The claim targets 2025. Google DeepMind's official blog (source 0) confirms an advanced Gemini Deep Think scored 35/42 at IMO 2025, meeting the gold-medal cutoff. OpenAI's First Proof blog (source 1)
2026-06-26: added the Yudkowsky–Christiano bet-odds source (LessWrong, >16% vs <8%) to the receipt; context only — claim, grade, and the 2-model verification binding are unchanged

Overduelabor

Deep learning outperforms radiologists within five years, so we should stop training them now

by Geoffrey Hinton · called 2016

Overdue · deadline 2021 passed, awaiting grade

2 receiptsOpen verificationClose▾

Overdue

Overdue · deadline 2021 passed, awaiting grade

US radiologist workforce headcount vs. 2016US radiologist count rose 17.3% (30,723 to 36,024) from 2014 to 2023, amid a persistent shortage

How it's graded

Met if AI had displaced human radiologists (falling demand / headcount) by ~2021; failed if the workforce kept growing and AI became a complement

Receipts · 2

People should stop training radiologists now. It's just completely obvious within five years deep learning is going to do better than radiologists.
UAB Reporter (quoting Geoffrey Hinton, 2016) · 2016
the number of radiologists increased 17.3% (from 30,723 to 36,024) from 2014 to 2023
Harvey L. Neiman Health Policy Institute · 2024

Ledger history

2026-06-24: seeded
2026-06-25: demoted pending bulletproof re-grade

Overduecapabilities

Within three to six months AI will be writing 90% of code, and within a year essentially all of it

by Dario Amodei (Anthropic CEO) · called 2025-03

Overdue · deadline Mar 2026 passed, awaiting grade

1 receiptOpen verificationClose▾

Overdue

Overdue · deadline Mar 2026 passed, awaiting grade

Share of production code generated by AI vs. written by humansContested: Amodei cites high internal use at Anthropic, but there is no verified industry-wide 90% figure and independent analyses dispute it

How it's graded

Met if, broadly across software, AI is generating ~90% of code by late 2025; failed if human-written code stays the clear majority of production software

Trajectory unverified indicators — not graded receipts

Sep 2025By late 2025: AI writing ~90% of code (the 3–6 month call) missedSix months on, no verified industry-wide 90% figure materialized; independent analyses put AI's share far lower.assessed Jun 2026indication↗
Mar 2026By Mar 2026: AI writing essentially all code (the 1-year call) missedA year on, human-written code remains the clear majority of production software; even Anthropic's internal 90% claim is disputed.assessed Jun 2026indication↗

Receipts · 1

Anthropic CEO Dario Amodei said AI could be writing '90% of the code' within three to six months.
Yahoo Finance · 2025-03

Ledger history

2026-06-24: seeded
2026-06-25: demoted pending bulletproof re-grade
2026-06-26: added trajectory checkpoints — both 3–6mo and 1-yr milestones missed

Pendingcapabilities

A superhuman coder exists by end of 2027

by AI-2027 report · called 2025-04

Pending · due Dec 2027 On track

2 receiptsOpen verificationClose▾

Pending

Pending · due Dec 2027

SWE-bench Verified %Top coding agents now exceed 85% on SWE-bench Verified (mid-2026), up from ~70% a year earlier — still short of autonomous senior-eng PRs

How it's graded

Met when a model autonomously completes a non-trivial PR end-to-end at senior-eng level

Trajectory unverified indicators — not graded receipts

Jun 2025Mid-2025: stumbling agents — first usable AI coding agents appear hitCoding agents emerged in 2025 and now autonomously resolve real GitHub issues on SWE-bench Verified.assessed Jun 2026indication↗
Jan 2026Early 2026: coding automation accelerates on paceTop agents now exceed 85% on SWE-bench Verified, up from ~70% a year earlier — fast progress, still short of autonomous senior-eng PRs.assessed Jun 2026indication↗
Dec 2027End 2027: a superhuman coder exists (the target) pendingThe claim's target milestone — not yet due.indication↗

Receipts · 2

AI 2027, published April 3 2025, is a month-by-month scenario forecast of superhuman AI by a team led by ex-OpenAI researcher Daniel Kokotajlo; the 'superhuman coder by end of 2027' milestone is one of its concrete, dated predictions.
AI 2027 (Kokotajlo, Alexander, Larsen, Lifland & Dean) · 2025-04
SWE-bench Verified tracks coding-agent performance on resolved GitHub issues.
SWE-bench Verified leaderboard · 2026-06

Ledger history

2026-06-20: seeded (#1)
2026-06-24: replaced placeholder evidence with the SWE-bench Verified leaderboard; verdict held on-track
2026-06-25: demoted pending bulletproof re-grade
2026-06-26: added trajectory checkpoints (AI-2027 milestones, on track) + refreshed SWE-bench measurement to >85%

Pendingtimelines

The first weakly general AI system is publicly announced around 2028

by Metaculus community forecast · called 2020

Pending · due 2028

0 receiptsOpen verificationClose▾

Pending

Pending · due 2028

Public announcement date vs. the live community medianunknown — pending a direct read of the live Metaculus median

How it's graded

Met if a system meeting the Metaculus weakly-general-AI resolution criteria is publicly announced near the community median

Receipts

No verified evidence yet.

Ledger history

2026-06-24: seeded; awaiting a verifiable read of the Metaculus median before grading

Pendingeconomy

Autonomous agents run measurable revenue by 2026

by Pundit X · called 2025-01

Pending · due Dec 2026

0 receiptsOpen verificationClose▾

Pending

Pending · due Dec 2026

attributed revenue %unknown

How it's graded

Met when a public company attributes >1% revenue to autonomous agents

Receipts

No verified evidence yet.

Ledger history

2026-06-20: seeded (#1)

Every AI forecast, graded against reality.

An AI built before the contest earns a gold medal at the International Mathematical Olympiad by 2025

Deep learning outperforms radiologists within five years, so we should stop training them now

Within three to six months AI will be writing 90% of code, and within a year essentially all of it

A superhuman coder exists by end of 2027

The first weakly general AI system is publicly announced around 2028

Autonomous agents run measurable revenue by 2026

Field Notes

How claims are verified