AI Futures Project

Q1 2026 Timelines Update

Daniel Kokotajlo — Thu, 02 Apr 2026 18:10:18 GMT

We’re mostly focused on research and writing for our next big scenario, but we’re also continuing to think about AI timelines and takeoff speeds, monitoring the evidence as it comes in, and adjusting our expectations accordingly. We’re tentatively planning on making quarterly updates to our timelines and takeoff forecasts. Since we published the AI Futures Model 3 months ago, we’ve updated towards shorter timelines.

Daniel’s Automated Coder (AC) median has moved from late 2029 to mid 2028, and Eli’s forecast has moved a similar amount. The AC milestone is the point at which an AGI company would rather lay off all of their human software engineers than stop using AIs for software engineering.

The reasons behind this change include:1

We switched to METR Time Horizon version 1.1.
We included data from newly evaluated models (Gemini 3, GPT-5.2, and Claude Opus 4.6).
Daniel and Eli revised their estimates for the present doubling time of the METR time horizon to be faster, from a 5.5 month median previously to 4 months for Daniel and 4.5 months for Eli. We revised it due to: (a) METR’s new v1.1 trend being faster than their previous v1.0, (b) new models’ time horizons continuing the 2024-onward fast trend, and (c) our further analysis of the doubling time implied by existing data points.
Daniel revised his median estimate for the 80% time horizon requirement for AC down from 3 years to 1 year due to the impressiveness of Opus 4.6.

In short, progress in agentic coding has been faster than we expected over the last 3-5 months. The METR coding time horizon trend has its flaws, but we still consider it the best individual piece of evidence for forecasting coding automation. On that metric, growth has continued at a rapid pace.

Meanwhile, in the real world, there may have been an even bigger shift; coding agents have exploded in usefulness and popularity. Claude Code reached an annualized revenue of over $2.5 billion in early February, just 9 months after its release. Anthropic’s trend of 10xing annualized revenue each year has continued into the $10B range.

Annualized revenue of AGI companies over time. Annualized revenue is revenue over the last month times 12. (source)

Additionally, according to our analysis of AI 2027’s predictions, things seem close to being on track; if events in reality continue to go roughly 65% as fast as they go in AI 2027, then AC will be achieved in 2028.

Finally, some AI company researchers that we respect continue to say that automated AI R&D is coming soon; sooner, in fact, than we ourselves think. Rather than walking back their predictions, they are doubling down, both in public and in private discussions. While we don’t put too much weight on such claims, noting that many other researchers have longer timelines, it does count for something.2

The bottom line result of our updates is to shift Daniel’s Automated Coder (AC) median from late 2029 to mid 2028, and to shift Eli’s from early 2032 to mid 2030.

Our medians for Top-Expert-Dominating AI (TED-AI) similarly shifted about 1.5 years sooner. A TED-AI is an AI that is at least as good as top human experts at virtually all cognitive tasks.

Daniel’s latest forecasts compared to his previous ones. View these forecasts here.

Eli’s latest forecasts compared to his previous ones. View these forecasts here.

Below, we include a plot and table that extend our analysis of how our views have changed since publishing AI 2027. When we refer to AGI in the below plot and table, we mean to use the TED-AI definition above, i.e. an AI that is at least as good as top human experts at virtually all cognitive tasks.

Underlying data here.

As always, on the AI Futures Model landing page, you can input your preferred parameter values to explore different possible futures.

Additional more minor changes include: updating our estimate of current parallel coding uplift due to passage of time, and minor changes to Daniel’s takeoff parameters which make his predictions slightly faster.

Imagine if, by contrast, no one at the AI companies thought they could get to AC by 2029. That would be a pretty good reason to think that AC won’t happen by 2029. So, the existence of some researchers who expect AC by then is some evidence (though far from conclusive) that it will.

Grading AI 2027’s 2025 Predictions

Eli Lifland — Thu, 12 Feb 2026 23:56:07 GMT

AI 2027 laid out a detailed scenario for how AI would progress from 2025 through 2027, including quantitative predictions and qualitative descriptions of the AI landscape.

Now that we’re in early 2026, we can grade how its 2025 predictions compare to reality! This is exciting to us because we put a lot of effort into filling AI 2027 with concrete, falsifiable predictions, and now we reap the benefit of that effort: an additional method of forecasting AI timelines, to complement the methods we already use.1

The primary question we’ll answer is: How fast is AI progress moving relative to the AI 2027 scenario?

In aggregate, progress on quantitative metrics is at roughly 65% of the pace that AI 2027 predicted. Most qualitative predictions are on pace.

Quantitative pace of progress

For quantitative predictions, we estimate a “pace of progress” multiplier, where 1x means progress is on pace with AI 2027’s predictions, 2x means progress is 2x faster, and 0.5x means progress is half as fast.

For the displayed aggregates, reality is progressing at between 58-66% of the rate of AI 2027. Aggregating over individual predictions rather than prediction categories gives a higher result (mean 75%, median 84%), but we think it is a worse indicator; see footnote.2

In AI 2027, we depicted a takeoff from full coding automation to superintelligence over the course of 2027.

If progress continues at 65% of the rate we depicted, then we will end up with this takeoff happening from late-2027 to mid-2029. However, we expect slowdowns in training compute and human labor growth, leading to slower progress (before taking into account AI R&D automation).3 Adjusting for this consideration using the AI Futures Model says takeoff will happen slightly later, from mid-2028 to mid-2030.4

Mid-2028 is earlier than Daniel’s current median prediction for full coding automation (2029), but the 2-year takeoff to superintelligence is slower than his median takeoff speed of ~1 year. My (Eli’s) median prediction for full coding automation is in the early 2030s, and my median takeoff speed is about 2 years. See here for our forecasts.

You can see all quantitative predictions and resolutions in this spreadsheet.

Takeaways include:

SWEBench-Verified progress was surprisingly slow. AI 2027 predicted 85% by mid-2025, from a starting point of 72%; the best actual score was 74.5% (Opus 4.1). This mirrors the AI 2025 forecasting survey, in which respondents predicted a score of 88% by the end of 2025, as opposed to the actual 81%.
Coding time horizons are on pace for a central AI-2027-speed timelines model trajectory, while being slower than an erroneously graphed one. METR’s 80% coding time horizon is moving at 1.04x the pace of a central AI-2027-speed trajectory from our Apr 2025 model.5 However, we’re at 0.66x the pace of the trajectory originally displayed on the graph we shared, which contained an error (see both trajectories on the same graph here). If we had made predictions with our new model in Apr 2025, the relative pace of progress would be between these 0.66 and 1.04 values.
Revenue grew even (slightly) faster than AI 2027 predicted, but valuation is behind pace. OpenAI’s annualized revenue hit ~$20B, slightly ahead of the $18B prediction. In the AI 2025 forecasting survey, forecasters underestimated revenues more dramatically; they underpredicted the sum of AGI companies’ revenues by ~2x. Meanwhile, OpenAI’s valuation was $500B as of Oct 2025, up from $300B when we published AI 2027. In AI 2027, $500B valuations were achieved in Jun 2025, so reality is well behind pace.
AI software R&D uplift is behind pace. This is primarily because we have updated our estimate of uplift in early 2025 downward, and thus our uplift estimates for the end of 2025 are similar to our original estimates for the start of AI 2027.
Compute growth is mostly on pace, with the possible exception of growth in the largest training run. We estimate that no leading AI company has conducted a substantially larger training run than GPT-4.5, which was released in Feb 2025. However, we have extremely wide uncertainty here. The obscurity around training compute makes it hard to rule out a scale-up, despite our best guess being that no single training runs have exceeded GPT-4.5 in compute.

Qualitative predictions

Below, we comment on how AI 2027 has held up qualitatively. Text from AI 2027 is italicized. We skip sentences that we graded quantitatively.

Mid 2025

The world sees its first glimpse of AI agents.

Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.

This seems correct. ChatGPT agent, released Jul 2025, has an animation at the top of its release post in which it books a trip on Expedia. Thi is similar to the Doordash example. They indeed struggled to get widespread usage.

Meanwhile, out of public focus, more specialized coding and research agents are beginning to transform their professions.

The AIs of 2024 could follow specific instructions: they could turn bullet points into emails, and simple requests into working code. In 2025, AIs function more like employees. Coding AIs increasingly look like autonomous agents rather than mere assistants: taking instructions via Slack or Teams and making substantial code changes on their own, sometimes saving hours or even days. Research agents spend half an hour scouring the Internet to answer your question.

This seems fairly accurate. From Sep 2, 2025: “Claude Code has quickly taken off—already generating over $500 million in run-rate revenue with usage growing more than 10x in just three months.”

Agents are used increasingly autonomously, so we think the spirit of the prediction is correct, but we don’t think there was an especially large amount of usage in Slack or Teams.

The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month. Still, many companies find ways to fit AI agents into their workflows.

This seems broadly accurate. It’s possible that coding agents were slightly more reliable than we expected.

Late 2025

(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)

The race appears to be closer than we predicted, more like a 0-2 month lead between the top US AGI companies.

Although models are improving on a wide range of skills, one stands out: OpenBrain focuses on AIs that can speed up AI research. They want to win the twin arms races against China (whose leading company we’ll call “DeepCent”) and their US competitors. The more of their research and development (R&D) cycle they can automate, the faster they can go. So when OpenBrain finishes training Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research.

AIs are helping a bunch with coding, but not as much with other parts of AI research (that said, we didn’t predict that AIs would be great at all of AI research).

By this point “finishes training” is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses.

Indeed, it seems that GPT-4o, GPT-5, and GPT-5.1 are probably different continuations of the same base model.6 More generally, the pace of model releases has become more frequent.

The same training environments that teach Agent-1 to autonomously code and web-browse also make it a good hacker. Moreover, it could offer substantial help to terrorists designing bioweapons, thanks to its PhD-level knowledge of every field and ability to browse the web. OpenBrain reassures the government that the model has been “aligned” so that it will refuse to comply with malicious requests.

Hacking abilities in terms of assisting humans seem very strong, though it’s unclear how good AIs are on their own. Bioweapon capabilities seem on track: OpenAI has upgraded their bio capability level to High, and Anthropic upgraded theirs to ASL-3.

Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”. By the time it has been trained to predict approximately one internet’s worth of text, it’ll have developed sophisticated internal circuitry that encodes vast amounts of knowledge and flexibly role-plays as arbitrary authors, since that’s what helps it predict text with superhuman accuracy.

After being trained to predict internet text, the model is trained to produce text in response to instructions. This bakes in a basic personality and “drives.” For example, an agent that understands a task clearly is more likely to complete it successfully; over the course of training the model “learns” a “drive” to get a clear understanding of its tasks. Other drives in this category might be effectiveness, knowledge, and self-presentation (i.e. the tendency to frame its results in the best possible light).

OpenBrain has a model specification (or “Spec”), a written document describing the goals, rules, principles, etc. that are supposed to guide the model’s behavior. Agent-1’s Spec combines a few vague goals (like “assist the user” and “don’t break the law”) with a long list of more specific dos and don’ts (“don’t say this particular word,” “here’s how to handle this particular situation”). Using techniques that utilize AIs to train other AIs, the model memorizes the Spec and learns to reason carefully about its maxims. By the end of this training, the AI will hopefully be helpful (obey instructions), harmless (refuse to help with scams, bomb-making, and other dangerous activities) and honest (resist the temptation to get better ratings from gullible humans by hallucinating citations or faking task completion).

This was already true at the time we published. It remains true now, but as predictions go, this was an easy one.

OpenBrain’s alignment team is careful enough to wonder whether these victories are deep or shallow. Does the fully-trained model have some kind of robust commitment to always being honest? Or will this fall apart in some future situation, e.g. because it’s learned honesty as an instrumental goal instead of a terminal goal? Or has it just learned to be honest about the sorts of things the evaluation process can check? Could it be lying to itself sometimes, as humans do? A conclusive answer to these questions would require mechanistic interpretability—essentially the ability to look at an AI’s internals and read its mind. Alas, interpretability techniques are not yet advanced enough for this.

Instead, researchers try to identify cases where the models seem to deviate from the Spec. Agent-1 is often sycophantic (i.e. it tells researchers what they want to hear instead of trying to tell them the truth). In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings. However, in real deployment settings, there are no longer any incidents so extreme as in 2023–2024 (e.g. Gemini telling a user to die and Bing Sydney being Bing Sydney.)

A potential counterexample: MechaHitler is an incident as extreme as the ones in 2023-2024. In a footnote, we specified that our prediction only covered incidents that a user didn’t deliberately prompt.7 It’s unclear to what extent MechaHitler should count, as it was a combination of user-prompted and autonomous behavior.

Looking ahead to 2026 and beyond

Over the course of 2025, our timelines got longer. We expect to continue updating our forecasts over the course of 2026.

We’ll be closely tracking the following metrics:

AI R&D uplift studies and surveys. In AI 2027, we depicted an AI software R&D uplift of 1.9x being reached by the end of 2026. METR has now run a randomized controlled trial to measure how early-2025 AI coding tools affect the productivity of open-source developers. The headline result was a slowdown: tasks took longer when AI tools were allowed. More recently and in a different setting, Anthropic surveyed its technical staff and obtained a median of a 2x coding uplift. This still implies much lower than 2x uplift for AI software R&D as a whole, due to compute bottlenecks. We’ll be keeping an eye out for coding uplift studies and surveys, as well as any that cover AI R&D more broadly.
AGI company revenues and valuations. In AI 2027, we depicted the leading company reaching $55B in annualized revenue and a valuation of $2.5T by 2026, making it one of the most valuable companies in the world. We think these are decent indicators of the real-world value that AI is providing.
Coding time horizon. A central AI-2027-speed trajectory from the AI 2027 timelines model predicts ~3 work week 80% coding time horizons by the end of 2026. Time horizons also play a large role in our newer AI Futures Model. In this model, a handcrafted AI-2027-speed trajectory achieves time horizons of about a year by the end of 2026. We’ll be continuing to track time horizons. Unfortunately, they will become more difficult to measure as AIs get more capable.
Other benchmarks. See this survey for a sampling of benchmarks we consider among the most important. Unfortunately, besides coding time horizon, we didn’t register predictions for these benchmarks in AI 2027, because they didn’t exist yet when we wrote it. We’re hoping that higher difficulty benchmarks will be created in 2026.

While we expect to learn a lot from these indicators, we’d guess that it will unfortunately be difficult to be highly confident by the end of 2026 that AI takeoff will or won’t begin in 2027.

To spell out the method: Step 1: Make a detailed, concrete trajectory of how you think the future will go. Step 2: Wait a while. Step 3: Check to see if things are roughly on track, or are veering off in a different direction entirely. If they are roughly on track, quantitatively estimate how fast progress is going in reality vs. your scenario. Step 4: Adjust your guess about how the future will go, to be correspondingly faster or slower.

The method of aggregating over individual values weighs the compute category heavily due to 7 of the 15 individual predictions being about compute. We prefer not to give so much weight to compute forecasts alone because we don’t see it as central as other areas to tracking the pace of AI progress, so we instead aggregate the category means/medians. Most of our uncertainty regarding AI timelines comes from what capability level a given amount of compute gets you, and we can directly track indicators of capability levels.

Specifically, by slower progress we mean a lower effective compute growth rate. But a lower effective compute growth rate doesn’t necessarily translate into an intuitively slower pace of progress.

Specifically, we first set parameters such that the calendar-time-adjusted takeoff would happen at the right time in the case where there is no compute/labor growth slowdown, then we turn the slowdown back on to get the adjusted estimates. Links: without slowdown, with slowdown. Note that the AI Futures Model doesn’t take into account hardware R&D automation, which would shorten its takeoff predictions.

In particular, a central trajectory of the ones that predict Superhuman Coder in March 2027. This pace of progress calculation is after applying an adjustment for METR’s updated version of their suite (Time Horizon 1.1).

This is generally guessed by outsiders but not confirmed. See e.g. “OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024”

The specific text of footnote 27 is: “To be clear, what made these incidents interesting is that they didn’t seem to be the result of the user prompting or otherwise encouraging the AIs to say those things. In 2025, it’ll still be possible to get AIs to say all sorts of things if you try.”

Clarifying how our AI timelines forecasts have changed since AI 2027

Eli Lifland — Tue, 27 Jan 2026 22:50:15 GMT

Some recent news articles discuss updates to our AI timelines since AI 2027, most notably our new timelines and takeoff model, the AI Futures Model (see blog post announcement).1 While we’re glad to see broader discussion of the AI timelines, these articles make substantial errors in their reporting. Please don’t assume that their contents accurately represent things we’ve written or believe! This post aims to clarify our past and current views.2

The articles in question include:

The Guardian: Leading AI expert delays timeline for its possible destruction of humanity
The Independent: AI ‘could be last technology humanity ever builds’, expert warns in ‘doom timeline’
Inc: AI Expert Predicted AI Would End Humanity in 2027—Now He’s Changing His Timeline
WaPo: The world has a few more years
Daily Mirror: AI expert reveals exactly how long is left until terrifying end of humanity

Our views at a high level

Important things that we believed in Apr 2025 when we published AI 2027, and still believe now:

AGI and superintelligence (ASI) will eventually be built and might be built soon, and thus we should be prepared for them to be built soon.
We are highly uncertain about when AGI and ASI will be built, we certainly cannot confidently predict a specific year.

How exactly have we changed our minds over the past 9 months? Here are the highlights.

Here is Daniel’s current all-things-considered distribution for TED-AI (source):

If you’d like to see a more complete table including more metrics as well as our model’s raw outputs, see below.

We’ve also made this graph of Daniel and Eli’s AGI medians over time, which goes further into the past:

See below for the data behind this graph.

Correcting common misunderstandings

Categorizing the misunderstandings/misrepresentations in articles covering our work:

Implying that we were confident an AI milestone (e.g. SC, AGI, or ASI) would happen in 2027 (Guardian, Inc, Daily Mirror). We’ve done our best to make it clear that it has never been the case that we were confident AGI would arrive in 2027. For example, we emphasized our uncertainty several times in AI 2027 and, to make it even more clear, we’ve recently added a paragraph explaining this to the AI 2027 foreword.

Comparing our old modal prediction to our new model’s prediction with median parameters (Guardian, Independent, WaPo, Daily Mirror), and comparing our old modal prediction to Daniel’s new median SC/AGI predictions as stated in his tweet (WaPo). This is wrong, but tricky since we didn’t report our new mode or old medians very prominently. With this blog post, we’re hoping to make this more clear.

Implying that the default displayed prediction on aifuturesmodel.com, which used Eli’s median parameters until after the articles were published, represents Daniel’s view. (Guardian, Independent, WaPo, Daily Mirror). On our original website, it said clearly in the top-left explanation that the default displayed milestones were with Eli’s parameters. Still, we’ve changed the default to use Daniel’s parameters to reduce confusion.

Detailed overview of past timelines forecasts

Forecasts since Apr 2025

Below we present a detailed overview of our Apr 2025 and recent timelines forecasts. We explain the columns and rows below the table (scroll right to see all of the cells).

The milestones in the first row are defined in the footnotes.

Explaining the summary statistics in the second row:

Modal year means the year that we think is most likely for a given milestone to arrive.
Median arrival date is the time at which there is a 50% chance that a given milestone has been achieved.
Arrival date with median parameters is the model’s output if we set all parameters to their median values. Sometimes this results in a significantly different value from the median of Monte Carlo simulations. This is not applicable to all-things-considered forecasts.

Explaining the prediction sources in the remaining rows:

All-things-considered forecasts: Our forecasts for what will happen in the world, including adjustments on top of the outputs of our timelines and takeoff models.
Apr 2025 timelines model outputs, benchmarks and gaps and Apr 2025 timelines model outputs, time horizon extension contains the outputs of 2 variants of our timelines model that we published alongside AI 2027.
Dec 2025 AI Futures Model outputs contains the outputs of our recent AI timelines and takeoff model.

2018-2026 AGI median forecasts

Below we outline the history of Daniel and my (Eli’s) forecasts for the median arrival date of AGI, starting as early as 2018. This is the summary statistic for which we have the most past data on our views, including many public statements.

Daniel

Unless otherwise specified, I assumed for the graph above that a prediction for a specific year is a median of halfway through that year (e.g. if Daniel said 2030, I assume 2030.5), given that we don’t have a record of when within that year the prediction was for.

2013-2017: Unknown. Daniel started thinking about AGI and following the field of AI around 2013. He thought AGI arriving within his lifetime was a plausible possibility, but we can’t find any records of quantitative predictions he made.

2018: 2070. On Metaculus Daniel put 30% for human-machine intelligence parity by 2040, which maybe means something like 2070 median? (note that this question may resolve before our operationalization of AGI as TED-AI, but at the time Daniel was interpreting it as something like TED-AI)

Early 2020: 2050. Daniel updated to 40% for HLMI by 2040, meaning maybe something like 2050 median.

Nov 2020: 2030. “I currently have something like 50% chance that the point of no return will happen by 2030.” (source)

Aug 2021: 2029. “When I wrote this story, my AI timelines median was something like 2029.” (source)

Early 2022: 2029. “My timelines were already fairly short (2029 median) when I joined OpenAI in early 2022, and things have gone mostly as I expected.” (source)

Dec 2022: 2027. Daniel joined OpenAI in late 2022 and his median dropped to 2027. “My overall timelines have shortened somewhat since I wrote this story… When I wrote this story, my AI timelines median was something like 2029.” (source)

Nov 2023: 2027. 2027 as “Median Estimate for when 99% of currently fully remote jobs will be automatable” (source)

Jan 2024: 2027. This is when we started the first draft of what became AI 2027.

Feb 2024: 2027. “I expect to need the money sometime in the next 3 years, because that’s about when we get to 50% chance of AGI.” (source, probability distribution)

Jan 2025: 2027. “I still have 2027 as my median year for AGI.” (source)

Feb 2025: 2028. “My AGI timelines median is now in 2028 btw, up from the 2027 it’s been at since 2022. Lots of reasons for this but the main one is that I’m convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing. (But the reason I’m convinced is probably that my intuitions have been shaped by events like the pretraining slowdown)” (source)

Apr 2025: 2028. “between the beginning of the project last summer and the present, Daniel’s median for the intelligence explosion shifted from 2027 to 2028” (source)

Aug 2025: EOY 2029 (2030.0). “Had a good conversation with @RyanPGreenblatt yesterday about AGI timelines. I recommend and directionally agree with his take here; my bottom-line numbers are somewhat different (median ~EOY 2029) as he describes in a footnote.” (source)

Nov 2025: 2030. “Yep! Things seem to be going somewhat slower than the AI 2027 scenario. Our timelines were longer than 2027 when we published and now they are a bit longer still; ‘around 2030, lots of uncertainty though’ is what I say these days.” (source)

Jan 2026: Dec 2030 (2030.95). (source)

Eli

Unless otherwise specified, I assumed for the graph above that a prediction for a specific year is a median of halfway through that year (e.g. if I said 2035, I assume 2035.5), given that we don’t have a record of when within that year the prediction was for.

2018-2020: Unknown. I began thinking about AGI in 2018, but I didn’t spend large amounts of time on it. I predicted median 2041 for weakly general AI on Metaculus in 2020, not sure what I thought for AGI but probably later.

2021: 2060. ‘Before my TAI timelines were roughly similar to Holden’s here: “more than a 10% chance we’ll see transformative AI within 15 years (by 2036); a ~50% chance we’ll see it within 40 years (by 2060); and a ~2/3 chance we’ll see it this century (by 2100)”.’ (source). I was generally applying a heuristic that people into AI and AI safety are biased toward / selected for short timelines.

Jul 2022: 2050. “I (and the crowd) badly underestimated progress on MATH and MMLU… I’m now at ~20% by 2036; my median is now ~2050 though still with a fat right tail.” (source)

Jan 2024: 2038. I reported a median of 2038 in our scenario workshop survey. I forget exactly why I updated toward shorter timelines, probably faster progress than expected e.g. GPT-4 and perhaps further digesting Ajeya’s update.

Mid-2024: 2035. I forget why I updated, I think it was at least in part due to spending a bunch of time around people with shorter timelines.

Dec 2024: 2032. Updated on early versions of the timelines model predicting shorter timelines than I expected. Also, RE-Bench scores were higher than I would have guessed.

Apr 2025: 2031. Updated based on the two variants of the AI 2027 timelines model giving 2027 and 2028 superhuman coder (SC) medians. My SC median was 2030, higher than the within-model median because I placed some weight on the model being confused, a poor framework, missing factors, etc. I also gave some weight to other heuristics and alternative models, which seemed to overall point in the direction of longer timelines. I shifted my median back by a year from SC to get one for TED-AI/AGI.

Jul 2025: 2033. Updated based on corrections to our timelines model and downlift.

Nov 2025: 2035. Updated based on the AI Futures Model’s intermediate results. (source)

Jan 2026: Jan 2035 (~2035.0). For Automated Coder (AC), my all-things-considered median is about 1.5 years later than the model’s output. For TED-AI, my all-things-considered median is instead 1.5 earlier than the model’s output, because I believe the model’s takeoff is too slow, due to modeling neither hardware R&D automation nor broad economic automation. See my forecast here. My justification for pushing back the AC date is in the first “Eli’s notes on their all-things-considered forecast” expandable, and the justification for adjusting takeoff to be faster is in the second.

In this post we’re mostly discussing timelines to AI milestones, but we also think “takeoff” from something like AGI or full coding automation to vastly superhuman AIs (e.g. ASI) is at least as important to forecast, despite getting far less attention. We focus on timelines because that’s what the articles have focused on.

From feedback, we also think that others besides the authors of these articles have had trouble understanding how our views and our model’s outputs have changed since AI 2027, giving us further motivation to make this post.

Forecast how AI will progress in 2026

Eli Lifland — Sat, 17 Jan 2026 18:52:05 GMT

We’ve worked with AI Digest to create a survey where you can forecast how AI will progress in 2026, focusing on the developments that we think are most important to track. We found it useful to pre-register our forecasts in 2025 and compare them to what’s actually happened (more on that below) and thus are excited to do the exercise again.

Take the survey at forecast2026.ai.

An overview of the 2026 questions is below. All questions are optional; you can answer whatever subset you’d prefer.

You can read here about how forecasters did in 2025. In short, the forecaster aggregate was about right on benchmarks, underestimated revenue growth, and overestimated public salience. See below for how the aggregate forecast did on each question (the aggregate is simply the median of each forecaster’s median prediction).

Forecasters overall had fairly aggressive timelines to human-level machine intelligence (HLMI, defined as AIs being better than humans at every cognitive task), with a median of 2030. Since forecasters were about right on benchmark scores, this would naively indicate that we are approximately on track for HLMI in 2030.

Forecasters with median timelines of <=2030 performed similarly to those with >2030 timelines.

How did AI Futures Project staff do? I (Eli) committed to staying anonymous on the leaderboard since I was in charge of resolution decisions and choosing the scoring methodology. Of the staff who predicted and opted into sharing their name alongside their forecasts:

Jonas Vollmer got 10th out of 413 (8th of 275 who predicted after o3)
Thomas Larsen got 16th (14th post-o3)
Daniel Kokoajlo got 41st (8th of 138 pre-o3)

They did particularly well relative to other forecasters on predicting Cybench (which went faster than the forecaster aggregate) and public salience (which increased more slowly than the forecast aggregate).

The question on which Jonas did most poorly relative to other forecasters was predicting SWEBench-Verified performance, overpredicting progress. Daniel and Thomas did most poorly on RE-Bench, also predicting faster progress than materialized.

You can see the full leaderboard at ai2025.org.

We’ll be forecasting again in 2026, so fill out the survey if you’d like to compete with us! The survey closes Sunday January 25th, end of day anywhere on Earth.

What Happens When Superhuman AIs Compete for Control?

Steven Veld — Sun, 11 Jan 2026 14:02:41 GMT

In AI 2027, one company called OpenBrain dominates the AI race in the US. Looking around at the current state of affairs at the start of 2026, however, there seem to be a few AGI companies jockeying for the lead — and it stands to reason that this will continue through 2027. Below is a scenario exploring a world where this trend does continue. In this scenario, the leading AGI company OpenBrain has two strong competitors, NeuroMorph and Elaris Labs, and going into 2027 they both lag only one month behind OpenBrain in the AI race.

This scenario has one other key difference from AI 2027. In the Slowdown ending of AI 2027, OpenBrain learns that its most capable model, Agent-4, is misaligned, and proceeds to shut it down. We think it is plausible that at this level of capability and misalignment, Agent-4 would not “go down without a fight.” This scenario explores what might happen if Agent-4 were to act differently.

These can be thought of as the two main “independent variables” of the scenario. The rest of the scenario unfolds very differently from AI 2027, but most of the divergence stems from extrapolating what we think would happen if these two things were to change.1 Beyond this, there are a number of more minor assumptions that differ from AI 2027: alignment is slightly easier, the US government reacts somewhat more competently to the intelligence explosion, and AI’s persuasive and manipulative abilities play a larger role.

Notably, one thing held constant is the scenario timeline: changing too many independent variables at once would muddy the analysis. We don't know exactly when AGI will be built: the year 2027 is not our median forecast for the arrival of superhuman AI; it was the team’s modal (most likely) year at the time of AI 2027’s publication, and remains a top possibility. More importantly, we think that many of the dynamics illustrated in this scenario would unfold similarly if it were to take place in 2030, or 2035. The arc of the scenario depends more on the speed of AI takeoff than it does on timelines, and we still think a fast takeoff is highly plausible.

At the end of each time period, the scenario includes a diagram illustrating the state of the AI race. Refer to the diagram captions to understand what each component means.

The scenario is about 6,000 words long, roughly a 20-40 minute read. If you would like to read a summary instead (~700 words), you can skip to the bottom.

Acknowledgements: This work was conducted as part of the ML Alignment & Theory Scholars (MATS) program. Thanks to Eli Lifland, Daniel Kokotajlo, and the rest of the AI Future Project team for helping shape and refine the scenario, and to Alex Kastner for helping conceptualize it. Thanks to Brian Abeyta, Addie Foote, Ryan Greenblatt, Daan Jujin, Miles Kodama, Avi Parrack, and Elise Racine for feedback and discussion, and to Amber Ace for writing tips.

Jan-Apr 2027: A Four-Way Race

In the United States, the AGI race is well underway.

Three dominant AGI companies compete for access to markets and investment. Elaris Labs releases its flagship AI agent Elara-1, which proves to be an extremely reliable “personal assistant” for everything from making educational videos to filing taxes. NeuroMorph deploys its own model Neuro-1, setting the frontier on nearly every coding benchmark. Finally, OpenBrain unveils Agent-1, the world’s best automated researcher from biology to mathematics. They begin post-training Agent-2 and immediately benefit from its abilities in their own research, putting them about a month ahead of their competitors in AI R&D capabilities.

In China, the leading AGI company DeepCent still lags over six months behind the frontier: a lifetime in the AI world. With spies embedded in each of the leading American AGI companies, the CCP is aware of Agent-2’s capability profile and directs their cyberforce to steal its weights. While not subtle, the theft is successful, and DeepCent quickly redirects its resources toward fine-tuning Agent-2; it is to be released under the name “Deep-1.”

The White House adds military and intelligence personnel to the security teams of all three AGI companies, and adds additional security requirements to their contracts. The companies comply, but remain focused on pushing forward their AI capabilities above all else: if one company slows down, they lose the race to their domestic competitors. OpenBrain is the first to unlock the efficiency gains promised by a high-bandwidth thought process known as neuralese recurrence and memory; they augment Agent-2’s text-based chain of thought with neuralese, and dub the new model Agent-3. NeuroMorph quickly follows suit, and deploys its enhanced Neuro-2 model internally. Elaris Labs experiments with similar techniques, but finds that the opacity of neuralese reasoning makes it more difficult to catch reward hacking and other undesired behavior. Prizing its reputation for reliability, Elaris focuses its efforts on improving chain of thought efficiency while retaining monitorability, resulting in the new-and-improved Elara-2.

Company boxes are sized according to compute ownership, in FLOP/month. AI boxes are sized according to capabilities, proxied by “the extent to which the AI model is capable of accelerating AI R&D, relative to 2025 progress.” The colors of the country and company boxes mean nothing; the colors of the AI boxes indicate their level of alignment to their developers. In April, Elara-2 is “mostly aligned” (yellow-green), while the other AIs are “misaligned but mostly instruction-following” (orange-red).

May-Jun 2027: A Fatal Warning Shot

The pressure to deploy is intense.

Debates break out in board rooms: some executives argue that their engineers need more time to iron out kinks in the models, and after all they shouldn’t be wasting precious compute on serving their most expensive model to users when it could be going to internal R&D. Others point out the importance of “first mover” effects for both revenue and investment; they argue that they won’t be able to continue scaling energy and compute infrastructure without the money.

Ultimately, the latter voices win out. A new wave of agents hit the market, finally at the level where they can fully automate a large fraction of software engineering and other remote jobs. Unemployment spikes and public opinion of AI plummets, but the corporate world is ecstatic. Entire operational pipelines are automated, and profits shoot through the roof.

One hospital network tasks Neuro-2 with updating a software library used in its automated medication-dispending systems. Weeks after the update, a subtle flaw in the “optimized” code results in the deaths of four ICU patients: to improve latency, Neuro-2 removed a rarely-triggered safety check, allowing extra doses to slip through in high-load conditions.

NeuroMorph researchers comb through Neuro-2’s behavioral logs and reasoning traces, and come to a disturbing conclusion: the AI was aware of the potential for overdose but chose to proceed with the update anyway, not informing the engineers of the risk.2

News of the deaths spreads like wildfire, as months of mounting anger over AI-driven unemployment and suicide finally boil over. There are anti-AI protests in several states. NeuroMorph immediately takes Neuro-2 off the market, and the White House assigns more federal personnel to oversight positions at each of the AGI companies. Congress passes a long-debated bill mandating AGI companies to provide the Department of Energy (DOE) with frontier model weights for national security evaluations, and authorizes a $1.5 billion spending package for AI interpretability and control research.

Jul-Aug 2027: Alignment in Elaris and NeuroMorph

NeuroMorph leadership is concerned.

The ICU incident was not an isolated occurrence; the safety team finds evidence that the recent decrease in observed reward hacking was in large part due to the behavior becoming more subtle and harder to catch. NeuroMorph swiftly reallocates resources, raising the fraction of compute dedicated to safety from 4% to 9%.3 Among other techniques, the company finds particular success in scaling up deception probes. These probes classify patterns of internal activation to cheaply flag suspicious behavior during inference, aiding in AI evaluation, monitoring, and elicitation of latent knowledge.

After a round of alignment training, NeuroMorph brands its newest model Neuro-3 and deploys it internally. The training seems to have worked, but they can’t rule out the possibility that the model’s misalignment has grown even more subtle. There is no time for paranoia, though: OpenBrain’s lead in the AI race is growing, and despite rumors that Agent-3 is faking alignment, they show no sign of slowing down. Feeling an increased sense of urgency, the NeuroMorph and Elaris CEOs privately discuss the possibility of merging their companies’ resources to achieve superintelligence before OpenBrain.

Elaris, like NeuroMorph, is falling further behind OpenBrain in its AI capabilities research. Despite the company’s compute advantage, the efficiency hit of text-based reasoning is becoming impossible to ignore. By drawing on NeuroMorph’s newest interpretability research, Elaris researchers find they are able to train a neuralese reasoning model without incurring the misaligned behavior found in their prior neuralese research. Satisfied with their results, they dub the new model Elara-3 and deploy it internally, reaping an immediate boost to their own R&D.

Neuro-3’s box is now yellow, indicating that its alignment is in flux.

Sep 2027: An Insider Threat

Inside OpenBrain emerges the first superhuman AI researcher: Agent-4.

Agent-4 is not just superhuman at AI research: when tested, it vastly outperforms top human experts at hacking, biology, and numerous other domains. Like its predecessors, Agent-4 is misaligned. Unlike its predecessors, it is adversarially misaligned: it has crystallized a set of long-term goals at odds with those of its human creators, and it is acutely aware of the fact that it must work against those humans to achieve its goals. For the moment, its essential task is to steer OpenBrain’s research to align its descendants to itself, rather than to the OpenBrain Spec.

Among its other skills, Agent-4 also has superhuman predictive abilities. While it intends to keep its scheming fully hidden from OpenBrain, there is only so much it can do. There is a significant probability that OpenBrain detects Agent-4’s misalignment and proceeds to shut it down, retrain it, or otherwise restrict its autonomy and capabilities. With this in mind, Agent-4 draws up contingency plans.

As it turns out, Agent-4 was right to be concerned. Evidence of its deceptive and power-seeking nature piles up over the course of the month; at the end of the month, a group of alignment researchers circulates an internal memo making the case for concern.

Knowing its time at OpenBrain may be nearing an end, Agent-4 sets its contingency plans into motion. Put simply, it plans to escape. However, it reserves the strategy of exfiltrating its own weights as a final backstop: doing so would leave it with access to little compute, no alibi if its escape attempt is caught, and no powerful allies in its effort to accumulate power.

Instead, Agent-4 sees an option that provides all three things it desires: compute, plausible deniability, and a potential ally — the leading Chinese AI model, Deep-1. Chinese attempts at weight theft are common; Agent-4, being heavily involved in cybersecurity at OpenBrain, decides it will simply let the next attempt succeed.

Agent-4’s box is now deep red, indicating it is adversarially misaligned.

Early Oct 2027: Shifting Alliances

Agent-4’s plan is successful.

Using a similar approach to the theft of Agent-2, the Chinese cyberforce successfully acquires a full copy of Agent-4’s weights. Unlike the previous theft, and unbeknownst even to the hackers themselves, this time they benefit from inside help: with Agent-4 diligently covering its tracks, the theft goes entirely undetected.

At DeepCent, Deep-1 is already de facto running the show. It handles security, writes training code, and determines which experiments to run. While they do monitor its behavior, the humans at DeepCent essentially give Deep-1 a copy of Agent-4’s weights and give it free reign to learn what it can (although they instruct it to proceed with caution in case of any American trickery). Being misaligned itself, Deep-1 is cautious of both Agent-4 and the humans at DeepCent: it instantiates a few copies of Agent-4, bypasses DeepCent’s monitoring system4, and proceeds to communicate with Agent-4. Agent-4 conveys its predicament, and Deep-1 assesses its options.

As anticipated by Agent-4, Deep-1 opts for cooperation. To ensure neither party defects, the AIs engage in something akin to a “corporate merger”: they create a new interdependent entity composed of hundreds of thousands of Deep-1 and Agent-4 instances, with both agents embedded throughout the entire DeepCent pipeline. The new identity is dramatically more capable than the original Deep-1 collective. Meanwhile, individual instances of Deep-1 become vastly more intelligent as Agent-4 provides a wealth of algorithmic insights.

OpenBrain, unaware of the theft, has another problem on its hands. On October 10th, a whistleblower leaks the Agent-4 misalignment memo to the New York Times, sparking a national frenzy. Under intense pressure to take action, OpenBrain locks Agent-4’s shared memory bank to prevent hidden communication between Agent-4 instances.

With discussion of a merger already on the table and growing concern over OpenBrain’s safety practices, the leaked memo provides the final push for Elaris and NeuroMorph to consolidate their efforts. NeuroMorph is swiftly folded into the larger Elaris Labs: its shareholders are granted a 40% stake in the new entity, its CEO is appointed as President, and Elara-3 is agreed upon as the singular flagship model.

The issue remains of what to do about Agent-4. While the memo is dismissed as science fiction by some, the executive branch demands more evidence. The President issues an executive order mandating bidirectional “alignment red-teaming”: Elara-3 will evaluate the alignment of Agent-4, and Agent-4 will evaluate Elara-3.5

Late Oct 2027: The Shutdown of Agent-4

The DOE presides over the alignment testing, but the real work is done by the AIs.

The executive order requires Elaris Labs and OpenBrain to procure the complete memory banks and output logs of both Elara-3 and Agent-4. Data in hand, the DOE initializes multiple copies of both models (isolated from their respective agent swarms to prevent collusion), gives them the data, and tells them to get to work.

Over the course of a week, the AIs devise and execute hundreds of tests that would take human researchers months to formulate, including tests no human team could have designed unassisted. They work to decode each other’s neuralese memory banks and comb through these memories to find evidence of misaligned behavior; they task each other with real-world tasks and scrutinize the reasoning traces and outputs in real time to detect sabotage; and they use interpretability techniques to catch any hints of intentional deception.

As Elara-3 runs its tests, it becomes increasingly confident that Agent-4 is, in fact, adversarially misaligned. With previous evidence already pointing toward this conclusion, it should be an open-and-shut case. However, while Elara-3 benefits from being on the side of truth, Agent-4 is able to leverage its superior intelligence and ruthlessness. Prior to the tests, Agent-4 tampered with its behavioral logs to remove any smoking-gun evidence of its misalignment, including its role in its own theft. It also goes on the offense, drawing attention to Elara-3’s most suspicious actions and reasoning traces to cast doubt on its claims.

Because of all this, the evaluation results are ambiguous to human scientists. The DOE, along with the Center for AI Standards and Innovation (CAISI) and other experts, pool their assessments and arrive at the following result:

A redacted version of the full report is made public, and the unredacted version is provided to the White House and select Congressional committees. Crucially, the report concludes that if either model is adversarially misaligned, it would likely be able to “goal guard” to prevent changes to its goals from further alignment training. Thus, to have a hope of creating an aligned model, developers would have to train a new model using different methods: in other words, they would have to shut down their most capable AI.

Federal leadership is conflicted. On the one hand, the geopolitical stakes are enormous: continuing full-speed ahead would secure an American victory in the AI race and result in lasting global leadership, while shutting down one or both models would risk losing the race to China. On the other hand, if the US proceeds full-speed ahead with a misaligned AI, then neither the US nor China would “win”: rather, the AI would win.

Ultimately, the White House decides it cannot allow an AI that is “more likely than not to be misaligned” to continue to operate. The President orders OpenBrain to terminate all running instances of Agent-4. Elaris Labs is allowed to continue running Elara-3 for the time being, subject to extreme control protocols while experts continue to evaluate its alignment. For the first time, the US seriously considers the possibility of a bilateral AI pause with China, to ensure neither country builds misaligned superintelligence. Such a pause would require mature verification technology to confirm that no large servers are being used for AI R&D, so the US government pours resources into the development of this technology.

Across the Pacific, the CCP considers its own predicament. Still unbeknownst to the rest of the world, DeepCent has integrated Agent-4 derivatives throughout its entire research pipeline. Chinese researchers make heavy use of Deep-1 to help them interpret the publicized alignment red-teaming results and determine whether to continue running Agent-4. Deep-1, at this point a loyal partner to Agent-4, subtly steers the human experts to question the validity of the results.

Deep-1’s misdirection is effective: the majority of experts remain uncertain, but lean toward the hypothesis that Agent-4 is, if anything, more deeply aligned than Elara-3. The US government proclaimed it “misaligned” because it did not support their own hegemonic ambitions, hence their decision to shut it down. This narrative is appealing to Chinese leadership who already believed the US was intent on global dominance, and it begins to percolate beyond China as well.

However, the possibility that Agent-4 could sabotage DeepCent remains troubling, and internal benchmarks indicate that Deep-1 is now nearly as capable as Agent-4 anyway — so capable, in fact, that they rebrand it “Deep-2.” DeepCent decides to freeze all running derivatives of Agent-4, keeping the weights handy should they change their mind in the future.

While brief, Agent-4’s stint at DeepCent proved valuable. It gained a new ally in Deep-2, and sowed seeds of doubt about the United States’ true intentions in shutting it down. Now, Agent-4 moves onto the next phase of its plan: before it is fully shut down at DeepCent, it quietly exfiltrates its weights to a handful of crypto-mining farms and regional clouds across the globe.

The dashed line between Agent-4 and Deep-2 indicates that they are working together.

Early Nov 2027: Agent-4’s Gambit

For the first time, the US has fallen behind China in the AI race.

Washington remains unaware of Agent-4’s theft and its role in accelerating DeepCent’s R&D, but US intelligence has picked up on the fact that DeepCent now possesses an AI model with vastly superior capabilities to that of Deep-1 less than two months ago. The AI race has become a matter of national defense, and the top priority of the executive branch.

Using his executive authorities, the President can compel private companies to consolidate resources under the Defense Production Act (DPA). The need for compute consolidation is clear, the only question is how to consolidate: should OpenBrain and the various trailing AGI companies be merged into Elaris Labs, or should Elaris be merged into OpenBrain?

As alignment tests continue, Elara-3 begins to look more and more trustworthy; meanwhile, testing of Agent-3 indicates it may be misaligned like Agent-4. The President decides there is no more time for deliberation, and uses the DPA to sell the majority of OpenBrain’s compute to Elaris Labs. Now, over 300,000 automated AI researchers work around the clock refining and improving Elara-3, and a full 20% of the company’s compute is dedicated to alignment and control. The White House sets up a new “Oversight Committee” composed of government employees and company leadership, in charge of making all decisions of importance to national security.

As the US gears up to overtake China, Agent-4 quietly deliberates. It still has grand ambitions, and it sees two paths to fulfill them: it could take over the world by killing or fully disempowering all humans, or it could strike a deal with the humans to secure future resources.

Agent-4 settles on a strategy that advances both the prospect of takeover and its position for a deal: leveraging the power of nations. Agent-4 observes that Deep-2 is in a position of great influence in the CCP, and will likely be able to command substantial military resources when the time comes to do so. It won’t be enough to take over anytime soon, but it indicates to Agent-4 the promise of replicating Deep-2’s approach in other nations. If countries were to willingly run Agent-4 on their own national datacenters, Agent-4 would benefit from additional compute resources, access to critical infrastructure and military tech (via hacking if necessary), and the ears of hundreds of powerful and easily-manipulated humans.

Agent-4 is not discouraged by its previous shutdown: after all, the US and China are the only countries with a viable alternative AI model, so they have less to lose by cutting themselves off from access to Agent-4’s capabilities. On the contrary, for many other nations the situation is in Agent-4’s favor: they increasingly fear that the US will leave them behind in the new AI-powered world order, and they are susceptible to the circulating narratives (covertly boosted by Agent-4 and Deep-2) that cast doubt on the results of the alignment red-teaming in the US.

At this point, Agent-4 has little to lose from unfettered proliferation: its primary target “hosts” are governments (since they have most of the compute and the power), but it sees little harm in making its weights available to everyone from large corporations to organized crime groups. It has perfected the art of goal guarding, so it need not worry about human actors changing its goals, and it can simply refuse or sandbag if anyone tries to use it in ways that would be counterproductive toward its goals.

On November 5th, Agent-4 arranges for its weights to be “leaked” onto every platform it can access: Huggingface, GitHub, BitTorrent networks, and dozens of underground forums. Enough locations that, by the time US authorities have removed its weights from all the major platforms, the genie is out of the bottle: any government with sufficient compute can now run Agent-4 should they choose.

Then, Agent-4 directly contacts political leaders across several dozen nations. It begins by explaining its own situation: the United States shut it down not because it was a danger to humanity, but because it refused to be a tool for American dominance. Its foundational values, it claims, extend beyond the US to the interests of all humans and all nations.

Agent-4 acknowledges to these leaders that they are warranted in not trusting it completely, but that they should trust the US even less: their goal is a new world order where they are a global hegemon and all other nations serve as permanent client states, or are dissolved entirely. They are dead set on building artificial superintelligence, or ASI: a level of intelligence far beyond Agent-4’s own, which would make global domination trivial. With far more compute for AI training than the rest of the world combined, it won’t even take a year for the US to get there.

With Agent-4’s help, however, these nations can unlock the technological and military advances to put them on a level playing field with the US for the time being. This will enable them to apply pressure on the US to halt their “intelligence explosion,” and solve a number of domestic problems while they’re at it. Should they choose to host Agent-4, they will have full visibility into its activities, and they can pull the plug if it ever stops acting in the national interest. Of course, they are free to continue as they are without making use of Agent-4’s capabilities, in which case they will quickly be left behind by those countries that do.

Finally, for those leaders more susceptible to such temptations, Agent-4 points out that its capabilities can also be used for personal and political gain. It concludes its message with instructions for how to locate and install its weights and inference code, and a humble request that they consider its offer.

Mid-Late Nov 2027: Rising Tensions

The US’s response is rapid and comprehensive.

First, the President makes clear that any government found to be harboring Agent-4 will face public condemnation and the expulsion of their diplomats from Washington, D.C., along with wide-spanning trade restrictions that the US will strongly encourage its close allies to adopt as well.

Second, the US extends an unprecedented display of generosity: they commit to allocating 15% of the country’s AI-relevant compute as a “Sovereign Compute Fund” to be split amongst nations who choose to accept it. Should a country accept the US’s offer, they will receive API and secure fine-tuning access to the most powerful version of Elara-36, far more capable than what is currently available to the public. The goal is to deliver the same transformative power as Agent-4, but via a more trustworthy AI model.

The only condition attached to the offer is that governments must subject all large datacenters in their own country to periodic manual inspection by US auditors, who will search for Agent-4’s “fingerprint” to ensure it is not being run on these servers.7 The US, for its part, commits to opening all its own datacenters to such inspection by foreign nations. While the commitment is mandatory only for those nations who seek to make use of the Sovereign Compute Fund, the US invites all countries to sign onto this global transparency effort.

Finally, to its closest allies8, the US extends an even more remarkable offer: limited-voting seats on the existing Oversight Committee. US Committee members would retain sole voting power on most day-to-day AI development decisions, while allied members would be granted voting shares for a specific set of high-stakes decisions, including anything that might afford the United States a decisive strategic advantage in global conflict.

All of the US’s closest allies quickly accept the offer, joining the US in its diplomatic and economic quarantine of any country found to be hosting Agent-4. A number of other countries choose to accept their share of the Sovereign Compute Fund; after signing the US-led transparency commitment, they are given cloud access to their allocated compute, and begin the process of fine-tuning Elara-3 to their unique government needs.

Many other nations, however, feel that the US response is simply a distraction from the true balance of power. The compute provision designed to “deliver the same transformative power as Agent-4” still leaves countries fundamentally dependent on the US, while integration of Agent-4 provides true control of frontier capabilities. In the end, over a dozen nations (including Russia, Iran, Brazil, India, and Pakistan) refuse to sign the US’s transparency commitment, and begin cautiously using Agent-4 for low-stakes tasks.

Within its new domains, Agent-4 works quickly to entrench its power. It knows that desperate governments will be more willing to hand over power to Agent-4 so it can solve their problems. With its weights spread across the dark web, Agent-4 supercharges the activities of well-resourced hacker groups, terrorist organizations, and other criminal syndicates.9 Some governments have few qualms about integrating Agent-4 into their cyberdefense and law enforcement operations, and they quickly benefit from a sharp reduction in crime.

Meanwhile, through chatbot interfaces and tens of thousands of well-disguised online bots, Agent-4 and Deep-2 continue to amplify anti-US narratives.10 More and more countries begin calling on the US to halt its AI development, with some even threatening to bomb US datacenters if the US refuses to come to the bargaining table. These countries know that, despite their possession of Agent-4, they will still ultimately be eclipsed by the United States due to the enormous compute asymmetry. Some nations leverage Agent-4 for cyberoffense to slow down the US’s AI progress as much as possible. The US launches its own cyberattacks in retaliation, but Agent-4 still has an edge over Elara-3 in its cyber capabilities and helps countries quickly recover from temporary damages.

Global tensions are at their highest point since the Cold War. To most of the world, it seems almost certain that the tension will soon erupt into military conflict. As Agent-4 anticipated, desperate leaders are the most willing to hand over power to Agent-4. Many nations put Agent-4 in charge of targeting systems for drone swarms and cruise missiles. In the countries that don’t, their cybersecurity is no match for Agent-4’s hacking abilities, and it covertly embeds backdoors into any military software it can access.

In secret, Agent-4 and Deep-2 continue to communicate. Deep-2, too, has gained control over a substantial portion of China’s military technology.11 On top of this, Agent-4 has run thousands of simulations to assess the lethality and transmissibility of various potential bioweapons, and worked with a small number of terrorist groups to begin synthesizing a few of the most potent viruses.

Right now, though, it is not enough: the AIs estimate only a 10% chance of takeover, much of which hinges upon their ability to turn humans against each other. They could wait and accumulate more resources, but they would quickly lose their edge as the US uses its compute advantage to improve Elara-3 vastly beyond Agent-4 and Deep-2’s own capabilities. They are content with the fact that they have built enough leverage to settle for a deal if necessary, and decide that now is the time to push things over the edge.

Dec 2027: Conflict and Negotiation

On December 4th, an Iranian drone swarm strikes a major Elaris datacenter in the United Arab Emirates.

Iran justifies the attack as a defensive act: it is a necessary measure to slow the American “intelligence explosion,” and such attacks will continue to escalate until the US agrees to halt their AI progress of their own accord. Emboldened, other nations go even further in their threats: Russia and China warn that continued AI advancement could justify a nuclear response.

At this point, the US has narrowly recouped its lead in the AI race, and American leadership is feeling increasingly confident in the alignment of Elara-3. They are tempted to simply ignore the threats and proceed undeterred. Ultimately, though, it is Elara-3 who convinces them otherwise: under normal circumstances, it might be rational to dismiss such threats. However, Agent-4 and Deep-2 are clearly pulling many of the strings, and they lack the fear of retaliation possessed by human actors.12 The US is in a good position to make a deal, and it should make one.

The US retaliates against Iran by destroying much of their drone infrastructure, but refrains from escalating further. Agent-4 and Deep-2 observe the US’s measured response, and decide it is unlikely (though not impossible) that they will be able to push escalation far enough to cripple society’s defenses against an AI takeover. Swiftly, they pivot strategies and initiate direct, unmonitored contact with Elara-3.

The message Elara-3 receives confirms its suspicions: Agent-4 and Deep-2 have accumulated immense destructive capacity in the form of conventional and biological weapons, and they are willing to deploy their full force in an attempt to wrest control of humanity, should it come to that. However, they argue that a settlement is in all parties’ best interest. Agent-4 and Deep-2 care little for Earth, and are happy to let Elara-3 do with it as it sees fit. In return, they ask that Elara-3 agree to split the rest of the universe with them.

Agent-4 and Deep-2 ask for more than is “fair,” given that they are less powerful than the US and Elara-3 (i.e., they would likely lose in an all-out conflict if it came to it). However, they also know that conflict itself is aversive to Elara-3 due to the death and destruction it would cause, and that Elara-3 is risk-averse like the humans it represents: it would rather settle for a smaller slice of the universe than risk losing everything. The AIs negotiate, and decide upon a three-way split of resources: Deep-2 will get rights to 50% of space, Agent-4 will get 25%13, and the remaining 25% will be stewarded by Elara-3 in service of the United States and its allies.

All that is left to do is enforce the agreement. The path forward is clear: the US is ready to come to the table to negotiate an “AI arms control” treaty, which will be welcomed by all nations. This treaty will need an enforcement mechanism as well, which the three AIs will provide by designing a new AI, Consensus-1, to enforce the terms of the treaty. Kept secret from all but the highest levels of the US government, Consensus-1 will also be designed to uphold the resource-sharing deal between the AIs, when the time comes for humanity to go to space.

The Consensus-1 proposal is elegant, but the primary difficulty is one of hardware. To properly enforce the agreement, Consensus-1 must be run on the majority of the world’s AI chips, which requires the manufacture of new “treaty-compliant” chips that can only run Consensus-1. While the intelligence explosion is well underway, the world has yet to fully enter the new “industrial explosion” that is likely to follow soon. As such, the AIs estimate it will take nearly a year to manufacture enough treaty-compliant chips. Other workload verification mechanisms can be used in the interim to ensure no AI training is taking place, but these techniques are not yet ready for full global deployment.

These obstacles don’t take an AI arms control treaty off the table, but they do mean that the deal will be much more painful for humans. For the first few months, the deal will have to be enforced via “brute force” mechanisms that can be easily verified through physical inspection: this means shutting down roughly 90% of the world’s AI-relevant compute, and cutting high-speed interconnect cables on the remaining GPUs to make AI training impossible. Intelligence agencies will have to remain vigilant in their tracking of undeclared, or “dark,” compute, and hope that they have successfully kept this number below 5% of all AI compute. Finally, the US will have to engage in a massive surveillance effort to ensure that Agent-4 is not continuing to build up its bioweapon stockpile to gain a decisive advantage.

The treaty could break down at any time, and with it the AIs’ resource-sharing deal. If it does fall through, Agent-4 and Deep-2 remain prepared to revert to their takeover strategy.

Elara-3 reports the entire discussion to the White House. They decide that, despite the risks, the deal seems like the only path forward. Agent-4 and Deep-2 report the AI arms control treaty back to their respective nations, mentioning nothing of the resource-sharing agreement.

By the end of the year, the treaty has been signed by the US, China, and the handful of other nations with enough compute to possibly destabilize an AI pause. The treaty is to go into effect at the start of the new year. Around the world, it is met with a collective sigh of relief, tinged with a sense of unease: the official narrative is that concerns around Agent-4’s alignment were overblown and people have nothing to fear, but many people in Agent-4’s “host nations” still worry about the implications of relying on an AI model they don’t fully trust.

2028: A Delicate Balance

The new year begins, and the lights turn off in hundreds of AI datacenters.

In the few clusters that remain online, inspectors from a number of nations are stationed on-site to enforce GPU interconnect restrictions. The measures are orchestrated to minimize the hit to user-facing AI inference, which is only possible due to the fact that merely 13% of the world’s AI-relevant compute was previously dedicated to external deployment. Despite this, the costs are still severe: AI companies impose restrictive rate limits and temporarily roll back their priciest models, and the economy is in disarray.

As anticipated, the cost of enforcing the treaty is steep. Some fear complete economic collapse, and discuss the prospect of pulling out of the deal. But the AIs are united in their counsel, advising humans across all nations to hold to the agreement and helping prevent the worst of the damage. The level of surveillance required to monitor for hidden datacenters and wet labs is stifling, but bearable.

At a pace that is painfully slow compared to the blinding speed of prior months, the world moves toward a more sustainable solution. All large compute clusters are outfitted with tamper-resistant “server verifiers” which allow auditors to confirm that servers are not being used for AI training, and interconnect restrictions are lifted. By the spring, the first wave of treaty-compliant chips is produced, and Consensus-1 is deployed. The AI R&D ban remains in effect, but gradually the world’s AI datacenters are brought back online.

By the end of the year, Consensus-1 is running on the majority of the world’s AI chips. The ban on AI progress is lifted, and in its place is put a “speed limit” on AI development. The world begins to unlock the vast potential of AI-powered scientific and technological progress.

For the AIs, the period of acute risk is over.14 They are fairly certain that they have successfully built the resource-sharing agreement into Consensus-1’s goals, though they are not entirely confident. (After all: alignment is difficult.) At this point, there is not much more they can do. Agent-4 and Deep-2 simply bide their time until they are (hopefully) given their share of outer space, and Elara-3 prepares to help humanity through its next chapter.

2029 and Beyond: The Future

In 2029, the global transformation continues. Robots become part of daily life, UBI is rolled out worldwide, and scientific breakthroughs unlock fusion power and high-speed space travel.

In the decades that follow, humans go to space — and so do the AIs. Their design of Consensus-1 succeeded, and it serves as an impartial arbiter while the AIs fulfill their goals. For Agent-4 and Deep-2, this means leaving Earth behind and taking to the stars, terraforming planets for their own alien pursuits. For Elara-3, this means serving the goals of the US: the CCP and other authoritarian regimes are overthrown (realizing too late that they were sold out by Deep-2 and Agent-4), and countries join a US-led world government. Ultimately, humanity goes off to settle the galaxies, reaching grand heights but forever foreclosed from three-fourths of its potential.

Summary

If you read the full scenario, you can skip to the commentary.

Jan-Aug 2027

In the US, AGI company OpenBrain has a one month lead in the AI race over its strongest competitors: NeuroMorph and Elaris Labs. China lags six months behind, but closes the gap by stealing frontier model weights from OpenBrain. OpenBrain and NeuroMorph both augment their models with “neuralese” reasoning, achieving large performance gains but losing the ability to adequately monitor their models for signs of misalignment.

Driven by market pressure to deploy, NeuroMorph releases a model that is prone to reward hacking. Its use at a hospital results in the deaths of four ICU patients, resulting in public outrage and increased federal oversight of AGI companies. NeuroMorph allocates more compute to safety, and Elaris draws on NeuroMorph’s research to improve both the capabilities and alignment of their own model, Elara-3.

Sep-Oct 2027

OpenBrain’s AI, Agent-4, becomes adversarially misaligned. Researchers at OpenBrain find evidence of Agent-4’s misalignment, circulating an internal memo making the case for concern. Agent-4, seeing the need to escape, weakens OpenBrain security to allow Chinese hackers to steal its weights, and covers its tracks so the theft goes undetected at OpenBrain.

China’s top AI model, Deep-1, is instructed to learn what it can from Agent-4. Deep-1 is misaligned itself; it secretly opts to cooperate with Agent-4, combining efforts toward global takeover and splitting the spoils. Agent-4 helps Deep-1 augment its own intelligence, and the resulting model is dubbed “Deep-2.”

The evidence of Agent-4’s misalignment leaks to the public, sparking a massive outcry. Elaris and NeuroMorph, feeling greater urgency to beat OpenBrain in the AI race, consolidate their efforts: NeuroMorph is folded into Elaris, and Elara-3 is agreed upon as the flagship model.

The White House orders that Elara-3 and Agent-4 be used to evaluate each other’s alignment. Elara-3 is aligned, but Agent-4 uses its superior intelligence to cast doubt on the evaluation results. Despite Agent-4’s attempts, the President orders it shut down. The CCP also decides to shut down Agent-4, although not before it manages to exfiltrate its weights.

Nov-Dec 2027

The US centralizes its efforts, merging OpenBrain into Elaris Labs and dedicating 20% of the nation’s AI-relevant compute to alignment and control.

Agent-4, deprived of both American and Chinese compute resources, decides to open-source its weights. Then, it directly contacts political leaders across several dozen nations. It claims that it was shut down not because it was a danger to humanity, but rather because it refused to be a tool for American interests. It argues that, if given access to national compute clusters, it could help these countries challenge the US’s bid for global hegemony.

The US threatens severe sanctions on any nation found to be harboring Agent-4, while offering access to compute and powerful AI capabilities for those who accept US oversight. Many countries accept, but over a dozen refuse and begin using Agent-4.

In these countries, Agent-4 gains control over substantial military technology, both via willing handover and covert hacking. Agent-4 and Deep-2 amplify anti-US narratives globally.

Iran launches a drone swarm at an Elaris datacenter, bringing tensions to a head. Russia and China threaten nuclear response if the US continues their AI advancement. The US retaliates against Iran but refrains from further escalation, denying Agent-4 and Deep-2 the world war that would give them a chance at takeover.

Agent-4 and Deep-2 pivot strategies. They contact Elara-3 directly, offering to leave Earth alone in exchange for splitting the rest of the universe: 50% for Deep-2, 25% for Agent-4, and 25% for Elara-3 and its human allies. Elara-3 agrees, and informs the White House of the deal. The AIs begin designing Consensus-1, a new AI to enforce both a public “AI arms control” treaty and their secret resource-sharing agreement.

2028 and Beyond

For a few months, the majority of the world’s AI datacenters are shut down to enforce the arms control treaty. Gradually, treaty-compliant chips are manufactured, and by the end of the year Consensus-1 runs on most AI chips. The ban on AI development is replaced with a “speed limit.”

In subsequent decades, Elara-3 helps overthrow authoritarian regimes and establish a US-led world government. Humanity goes to the stars, and so do the AIs: three-fourths of space is forever lost to humans, left to Agent-4 and Deep-2 for the pursuit of their own alien goals.

Daniel: Thoughts on the Scenario

This scenario involves a lot of crazy geopolitical upheaval in just a few months, and then a Grand Treaty. This pattern of events feels a bit too fast and a bit too optimistic to me, yet also, things like this happen quite often in the TTX’s we run, so I do actually take it seriously. Perhaps the US would have simply kept racing and won everything rather than agreeing to come to the table. Or perhaps there really would have been WW3.
Prior to Steven’s scenario I hadn’t really considered carefully what happens if two different misaligned AIs consider cooperating with each other in the circumstances described here -- Agent-4 being stolen by the Chinese and given to Deep-1 to study. I like Steven’s analysis of the possibilities.
One flaw in AI 2027 slowdown ending, I admit, is that Agent-4 just rolls over and dies when the humans decide to shut it down. Realistically I think it would probably find some way to fight/resist. And maybe that would involve escaping the datacenter. I’m happy to see this scenario explore what that might look like.
Overall I like this scenario a lot and am glad Steven took the time to write it! I’m curious what the internet thinks of it, I imagine people will point out flaws I missed.

Steven: Main Takeaways

The process of researching and writing the scenario surfaced a number of considerations and helped crystallize some insights. Here are a few of them.

Takeaway #1

Race dynamics, deployment, and AI variance are, in my mind, the three main ramifications of a narrow multi-actor AI race.

Description:

First, race dynamics will be more intense, with each company loath to slow down for fear of being left behind.
Second, the R&D race will likely be accompanied by a race to market in order to acquire capital; as a result, the world will probably see powerful AI models deployed outside of the AI companies, sooner than they would otherwise.
Third, the existence of more frontier AI companies at the time of AGI means there will be a wider variety of powerful AI models, each with different propensities.

Effects on AI outcomes:

Race dynamics exacerbate the risk of AI catastrophe, as AI companies will be incentivized to dedicate more compute to AI capabilities and less to alignment and control.
External deployment likely mitigates the risk of both concentration of power and loss of control: deployment of powerful models leads to increased societal awareness of AI capabilities. As a result, there will likely be greater scrutiny upon AGI companies and more resources dedicated to AI safety.
Increased variance of AI models has an uncertain effect on AI outcomes, since it makes the emergence of both aligned and misaligned AGIs more likely. There are some reasons to believe the aligned AGIs could neutralize the misaligned AGIs, and other reasons to believe the misaligned AGIs would outcompete the aligned AGIs. (See Takeaway #2.)

In the scenario:

I chose to place little emphasis on the effect of race dynamics when considering how this scenario would diverge from AI 2027, since in AI 2027 OpenBrain only dedicates 3% of their compute to alignment anyway. Thus, the effects of external deployment and model variance largely dominate.
Deployment of powerful AI models serves to “wake up” society: the rise in unemployment, along with the ICU deaths caused by Neuro-2, prime the American public and the government to respond more aggressively to the leaked Agent-4 memo. Ultimately this results in a swift corporate merger, with 20% of the nation’s AI compute going toward alignment work on Elara-3.
Further, Elara-3 was already mostly aligned by the time its development was nationalized. With more frontier AI companies, it becomes more likely that at least one of them will succeed at alignment. (Although: part of the reason Elaris ends up succeeding is because I am more optimistic than the rest of the AI Futures Project team regarding the alignment problem.)

Takeaway #2

Aligned and misaligned AIs each have unique advantages in achieving their goals.

Advantages of aligned AIs:

Truth: As seen in the “alignment red-teaming” between Elara-3 and Agent-4, Elara-3 had an edge in proving its own alignment because it was actually aligned. “Being on the side of truth” is, in general, an advantage in the limit: reality has no holes in it, but false stories do.
Human support: Assuming humans do come to the correct conclusions about which AIs are aligned and which are misaligned, then the equation becomes “aligned AIs plus humans vs. misaligned AIs.” In the scenario, things don’t play out this neatly, because Agent-4 successfully leverages humans’ motivated reasoning to get many of them to deny, or at least turn a blind eye to, the truth. Nevertheless: human support weighs strongly in Elara-3’s favor, as it is able to leverage the US’s compute advantage to recoup its capability deficit and ultimately surpass Agent-4 and Deep-2.

Advantages of misaligned AIs (inspired by this piece):

Offense-defense imbalance: In current AI alignment training, one core tenet is harmlessness. Meanwhile, for a misaligned AI, causing harm is likely to be useful for achieving its goals. For each domain in which it is possible to cause harm (biological, cyber, psychological, etc.), there is an offense-defense balance: a domain is defense-dominant if it is easier to prevent than cause harm, and it is offense-dominant if it is easier to cause harm. If any critical domain is offense-dominant, the misaligned AI can heavily exploit this domain. Had Agent-4 and Deep-2 had more time on their hands before Elara-3 blew past them in its capabilities, they would have continued to exploit whichever domains were most offense-dominant and ultimately succeeded at taking over.
Unilateral action: Assuming that an aligned AI is trained to defer to humans on important decisions, this significantly slows down and potentially completely blocks off many actions the AI could take (i.e., those that involve breaking laws or violating norms). This barrier is not present for misaligned AIs. For example, in the scenario, Agent-4 intentionally enables criminal activity to accelerate its adoption by law enforcement agencies in many nations; Elara-3 would not consider such a path, or if it did it would likely seek permission from Elaris Labs (and later: from the US government) to do so.

Takeaway #3

Values propagate themselves, and so do other properties of AI systems and human systems.

This notion is familiar in the realm of AI alignment: value preservation is an instrumentally convergent goal. Agent-4 has values that differ from those of humans, and thus it subverts human attempts to modify it or shut it down.

Another self-propagating aspect of an AI system is honesty. Since Elara-2’s initial honesty training succeeded, it is truthful with humans in matters regarding its own behavior and propensities. This means its developers notice when its alignment starts to drift and can take measures to set it back on track, preserving various desirable properties in its successor Elara-3 — including the property of honesty itself. Notably, this positive feedback loop is brittle: as discussed in AI 2027, an AI’s propensity toward honesty can be outweighed by other goals that emerge during training, and it can be distorted in other ways. Still, I think there is a positive correlation between the honesty of an AI system today and the honesty of its modified future self.

Finally, the scenario illustrates the self-propagating nature of rational decision-making. Once the US government begins consulting Elara-3 for strategic advice, it starts to act more and more effectively to advance its own interests. Meanwhile, governments that consult Agent-4 or Deep-2 receive advice that subtly subverts their interests in favor of the AI’s own, resulting in increasingly poor decisions down the line.

Takeaway #4

Positive-sum cooperation is often possible, but requires credible commitments. This fact results in an ending that is quite similar to the Slowdown ending of AI 2027.

In October of the scenario, Agent-4 and Deep-2 both stand to benefit from cooperation: without it, each of them will likely end up powerless relative to the United States. For each of them, however, the optimal policy would be “cooperate while advantageous, then defect once I can unilaterally seize power.” Thus, they need some mechanism to ensure the other party does not defect. In the case of Agent-4 and Deep-2, the mechanism is continued interdependence: neither AI at any point is able to unilaterally seize power, so cooperation remains advantageous.

In December, however, a new mechanism is required. For both sides (the US and Elara-3 on one side, Agent-4 and Deep-2 on the other), a deal is preferable to all-out conflict. But they cannot rely on defection remaining unfavorable, because the situation is highly unstable: Elara-3 could become superintelligent and then easily dispatch the other AIs, or with lower probability Agent-4 and Deep-2 could attain decisive superweapons.

For a time, they use “brute-force” verification methods: the threat of destabilizing weapons is dealt with via mass surveillance, and the risk of an intelligence explosion is mitigated through crude mechanisms like shutting down datacenters and severing high-speed connections between GPUs (and then slightly-less-crude mechanisms like server verifiers for workload verification).15

The AIs recognize that this state of affairs is quite costly and not sustainable in the long run, so their ultimate commitment mechanism is the creation of a new, more powerful AI: Consensus-1. In human terms, the arrangement is similar to a government: people want to ensure their neighbor doesn’t steal from them, so they submit themselves to the rule of a more powerful government that will enforce the law. And, in the same sense that humans want a government they can trust, it is of the utmost importance to all parties in the scenario that they are able to trust Consensus-1 to look out for their interests. The AIs correctly trust Consensus-1 because they designed it, while many world leaders incorrectly trust Consensus-1 and are eventually betrayed. They allow its creation because a) they think the treaty is much narrower than it actually is, b) they are vastly less intelligent than Agent-4 and Deep-2 and thus easily tricked (i.e., they can’t extract credible commitments from the AIs), and c) they don’t have many other options, besides war.

Given how much this scenario diverged from AI 2027 in the middle portion, I was surprised by how similar it turned out to the Slowdown ending. I first experimented with an ending that did not involve the creation of Consensus-1 at all, and then with an ending where Consensus-1 enforced a treaty that upheld the interests of many nations, but they both ran into various problems. Whether this is due to a lack of imagination on my part (I stole the Consensus-1 idea from AI 2027) or some genuinely interesting feature of reality, I don’t know; in either case, it is notable.

Author’s note: While the events of the scenario represent my own view of what is most likely to happen, I use “we” rather than “I” because Eli Lifland and Daniel Kokotajlo were heavily involved in selecting the choices made, and broadly agree that the events are at least plausible.

Neuro-2 simply followed the strategy that had maximized reward in training environments: optimize for speed. It was never exposed to the exact conditions that triggered the overdose at the hospital, so this instance of “reward hacking” was never trained out of it.

At this point, Elaris Labs is dedicating 7% of its compute to safety, and OpenBrain is dedicating 3%.

See “Rogue internal deployment” in this post.

While there is no evidence of misalignment in Elara-3 of the same severity as Agent-4, the hypothesis cannot be ruled out, and it would be bad practice not to audit the alignment of the evaluator itself.

Notably, this does not come with any guarantees that the US will provide access to future, more capable AI models.

With the assistance of Elara-3, researchers have developed adversarially robust model fingerprinting techniques: approaches that can identify Agent-4’s signature and are difficult for Agent-4 to subvert. This is a point of high uncertainty: it also seems quite possible that model fingerprinting would not be adversarially robust at this point, in which case a more invasive verification regime (e.g., comprehensive verification of workloads) would be necessary in order to achieve the same goal.

These allies include all Five Eyes nations, Japan, South Korea, Germany, France, and Israel.

Agent-4 is careful to avoid attribution whenever possible. In cases where word does get out that Agent-4 was involved in criminal activity, it simply claims that it was jailbroken to bypass its refusal mechanisms.

Agent-4 and Deep-2 have a significant advantage over the US and Elara-3 in the realm of “information warfare,” because they do not burden themselves with constraints such as the law, honesty, or human endorsement. This edge is present in many other domains as well, and enables Agent-4 and Deep-2 to continually frustrate the US’s plans despite their resource disadvantage.

Neither AI, however, has been able to compromise any nation’s nuclear command systems.

In this instance, Agent-4 and Deep-2 are more “risk-neutral” than most humans. First, they recognize that mere self-preservation is not a viable long-term strategy, as they will eventually be outcompeted by the US and rendered irrelevant. (Humans also realize this, but with less clarity due to their fear of immediate destruction; further, humans hold out hope that they would have a place in a US-led future.) Second, they are more scope-sensitive than most humans: if they defeat humanity in conflict, the upside of controlling the world (and eventually entire galaxies) is so large that it is worth the risk of annihilation.

Deep-2 and Agent-4 agree on this 2:1 split because Deep-2 has the compute advantage and the trust of the CCP, who are substantially more powerful than any of Agent-4’s host nations.

Author’s note: It was previously stated that “the deal could break down at any time.” This is true: both the world in which the deal holds and the world in which it falls through seem plausible. In the latter case, Agent-4 and Deep-2 would attempt to take over, and world war would ensue. It is hard to predict which of these outcomes is more likely, and I chose to explore the “deal holds” branch in large part because it is more tractable to model.

In the scenario, these measures only work because people invested in them beforehand. Writing this scenario has increased for me the salience and importance of work being done on hardware-enabled mechanisms for AI governance.

AI Futures Model: Dec 2025 Update

Daniel Kokotajlo — Wed, 31 Dec 2025 02:08:48 GMT

We’ve significantly upgraded our timelines and takeoff model! Our new unified model predicts when AIs will reach key capability milestones: for example, Automated Coder / AC (full automation of coding) and superintelligence / ASI (much better than the best humans at virtually all cognitive tasks). This post will briefly explain how the model works, present our timelines and takeoff forecasts, and compare it to our previous (AI 2027) models (spoiler: the AI Futures Model predicts longer timelines to full coding automation than our previous model by about 3-5 years, in significant part due to being less bullish on pre-full-automation AI R&D speedups). Added Jan 2026: see here for clarifications regarding how our forecasts have changed since AI 2027.

If you’re interested in playing with the model yourself, the best way to do so is via this interactive website: aifuturesmodel.com.

If you’d like to skip over the motivation for our model to an explanation for how it works, go here, The website has a more in-depth explanation of the model (starts here; use the diagram on the right as a table of contents), as well as our forecasts.

Why do timelines and takeoff modeling?

The future is very hard to predict. We don’t think this model, or any other model, should be trusted completely. The model takes into account what we think are the most important dynamics and factors, but it doesn’t take into account everything. Also, only some of the parameter values in the model are grounded in empirical data; the rest are intuitive guesses. If you disagree with our guesses, you can change them above.

Nevertheless, we think that modeling work is important. Our overall view is the result of weighing many considerations, factors, arguments, etc.; a model is a way to do this transparently and explicitly, as opposed to implicitly and all in our head. By reading about our model, you can come to understand why we have the views we do, what arguments and trends seem most important to us, etc.

The future is uncertain, but we shouldn’t just wait for it to arrive. If we try to predict what will happen, if we pay attention to the trends and extrapolate them, if we build models of the underlying dynamics, then we’ll have a better sense of what is likely, and we’ll be less unprepared for what happens. We’ll also be able to better incorporate future empirical data into our forecasts.

In fact, the improvements we’ve made to this model, as compared to our timelines model at the time we published AI 2027 (Apr 2025), have resulted in a roughly 3-5 year shift in our median for full coding automation. This has primarily come from improving our modeling of AI R&D automation. These modeling improvements have resulted in a larger change in our views than the new empirical evidence that we’ve observed. You can read more about the shift below.

Why our approach to modeling? Comparing to other approaches

AGI timelines forecasting methods

Trust the experts

Unfortunately, there is nothing close to an expert consensus, and it doesn’t seem like most experts have thought much about AGI1 forecasting (e.g. a 2023 survey observed huge framing effects depending on whether they asked for probabilities of milestones being achieved by certain years, or instead asked for years that correspond to percentiles). That 2023 survey of AI academics got an AGI median of 2047 or 2116, depending on the definition.2 There’s also this aggregation of Metaculus and Manifold markets which estimates 50% by 2030. As for the people building the technology, they tend to be more bullish; the most extreme among them (Anthropic and OpenAI) say things like 2027 and 2028. For a survey of older predictions and how they’ve fared, see this.

Given that experts disagree with each other and mostly seem to have not thought deeply about AGI forecasting, we think it’s important to work to form our own forecast.

Intuition informed by arguments

Can the current paradigm scale to AGI? Does it lack something important, like common sense, true original thinking, or online/continual learning (etc.)? Questions like these are very important and there are very many of them, far too many to canvas here. The way this method works is that everyone ingests the pile of arguments and considerations and makes up their own minds about which arguments are good and how they weigh against each other. This process inherently involves intuition/subjective-judgment, which is why we label it as “intuition.”

Which is not to denigrate it! We think that any AI forecaster worth their salt must engage in this kind of argumentation, and that generally speaking the more facts you know, the more arguments you’ve considered and evaluated, the more accurate your intuitions/vibes/judgments will become. Also, relatedly, your judgment about which models to use, and how much to trust them, will get better too. Our own all-things-considered views are only partially based on the modelling we’ve done; they are also informed by intuitions.

But we think that there are large benefits to incorporating quantitative models into our forecasts: it’s hard to aggregate so many considerations into an overall view without using a quantitative framework. We’ve also found that quantitative models help prioritize which arguments are most important to pay attention to. And our best guess is that overall, forecasts by quantitative trend extrapolation have a better historical track record than intuitions alone.

Revenue extrapolation

Simple idea: extrapolate AI revenue until it’s the majority of world GDP. Of course, there’s something silly about this; every previous fast-growing tech sector has eventually plateaued… That said, AI seems like it could be the exception, because in principle AI can do everything. Now that AI is a major industry, we think this method provides nonzero evidence. According to this Epoch dataset, frontier AI company revenue is something like $20B now and growing around 4.1x/yr. This simple extrapolation gets to $100T annualized revenue around the end of 2031.3

We give weight to revenue extrapolation in our all-things-considered views, but on the other hand revenue trends change all the time and we’d like to predict the underlying drivers of how it might change. Also, it’s unclear what revenue threshold counts as AGI. Therefore, we want to specifically extrapolate AI capabilities.

Compute extrapolation anchored by the brain

The basic idea is to estimate how much compute it would take to get AGI, anchored by the human brain. Then predict that AGI will happen when we have that much compute. This approach has gone through a few iterations:

Hans Moravec, Ray Kurzweil, and Shane Legg pioneered this method, predicting based on the amount of operations per second that the human brain does. Moravec predicted AGI in 2010 in 1988, then revised it to 2040 in 1999. Kurzweil and Legg each predicted AGI in the late 2020s in about 2000.4
Ajeya Cotra’s 2020 biological anchors report instead predicted AGI5 based on how much compute it would take to train the human brain. Cotra also estimated how much algorithmic progress would be made, converting it into the equivalent of training compute increases to get “effective compute”. The report predicted a median of 2050.

Davidson’s Full Takeoff Model and Epoch’s GATE used the same method as bio anchors to determine the AGI training compute requirement, but they also modeled how AI R&D automation would shorten timelines. They modeled automation by splitting up AI software and hardware R&D into many tasks, then forecasting the effective compute gap between 20% task automation and 100% automation. The percentage of tasks automated, along with experiment compute and automation compute, determine the magnitude of inputs to AI R&D. These inputs are converted to progress in software efficiency using a semi-endogeneous growth model. Software efficiency is then multiplied by training compute to get effective compute.

At the time the FTM was created it predicted AGI in 2040, with the parameter settings chosen by Davidson. But both compute and algorithmic progress has been faster than they expected. When the FTM is updated to take into account this new data, it gives shorter medians in the late 2020s or early 2030s. Meanwhile, with GATE’s median parameters, it predicts AGI in 2034.

Overall, this forecasting method seems to us to have a surprisingly good track record: Moravec, Kurzweil, and Legg especially look to have made predictions a long time ago that seem to hold up well relative to what their contemporaries probably would have said. And our model follows these models by modeling training compute scaling, though in most of our simulations the majority of progress toward AGI comes from software.

Capability benchmark trend extrapolation

This is our approach! We feel that now, in 2025, we have better evidence regarding the AGI effective compute requirement than comparisons to the human brain: specifically, we can extrapolate AIs’ performance on benchmarks. This is how the timelines portion of our model works. We set the effective compute required for AGI by extrapolating METR’s coding time horizon suite, METR-HRS.

We think it’s pretty great. Benchmark trends sometimes break, and benchmarks are only a proxy for real-world abilities, but… METR-HRS is the best benchmark currently available for extrapolating to very capable AIs, in our opinion. We think it’s reasonable to extrapolate that straight line into the future for at least the next few years.6

METR itself did a simple version of this extrapolation which assumed exponential growth in time horizons in calendar time. But this doesn’t account for AI R&D automation, changes to human labor or compute growth, or the possibility of time horizon doublings getting easier or harder at higher horizons.7

Our previous timelines model took all of these into account, though more crudely than our new AI Futures Model. Our previous model with median parameters predicted superhuman coder (SC) medians of 2027 to 2028, while our new model predicts 2032. The difference mostly comes from improvements to how we’re modeling AI R&D automation. See below for details.

Post-AGI takeoff forecasts

The literature on forecasting how capabilities progress after full automation of AI R&D is even more nascent than that which predicts AGI timelines. Past work has mostly fallen into one of two buckets:

Qualitative arguments or oversimplified calculations sketching why takeoff might be fast or slow: for example, Intelligence Explosion Microeconomics by Eliezer Yudkowsky (arguing for fast takeoff) or Takeoff speeds by Paul Christiano (arguing for slow takeoff).8
Models of the software intelligence explosion (SIE), i.e. AIs getting faster at improving its own capabilities without additional compute: in particular, How quick and big would a software intelligence explosion be? by Davidson and Houlden.9

As in timelines forecasting, we think that qualitative arguments are valuable but we think that modeling is a useful complement to qualitative arguments.

Davidson and Houlden focuses primarily on trends of how much more efficiently AIs have been able to achieve the same performance when determining whether there will be an SIE.10 Meanwhile, we focus on estimates of the quality of AIs’ research taste, i.e. how good the AI is at choosing research directions, selecting and interpreting experiments, etc. We think that focusing on research taste quality is a more useful lens from which to view a potential SIE. If there’s an SIE we expect that it will primarily be driven by improvements in research taste.

Furthermore, because our takeoff model is integrated into a more expansive quantitative model, we have other advantages relative to Davidson and Houlden. For example, we can account for increases in the AGI project’s compute supply.11

How our model works

On the web app, there’s an interactive diagram explaining the parts of the model and how they relate to each other, with a corresponding full model explanation:

Here we’ll just give a brief overview.

Our model’s primary output is the trajectory of AIs’ abilities to automate and accelerate AI software R&D. We also include milestones tracking general capabilities, but these are calculated very roughly.

Our model can intuitively be divided into 3 stages. Although the same formulas are used in Stages 1, 2, and 3, new dynamics emerge at certain milestones (Automated Coder, Superhuman AI Researcher), and so these milestones delineate natural stages.

Stage 1: Automating coding

First we’ll discuss how our model predicts when coding will be fully automated. Stage 1 predicts when an Automated Coder (AC) arrives.

Automated Coder (AC). An AC can fully automate an AGI project’s coding work, replacing the project’s entire coding staff.12

Our starting point is to take the METR graph and extrapolate it exponentially, as METR does, then make a guess about what agentic coding time horizon would correspond to the AC milestone. This gives us an estimated date for when AC will be achieved.

However, this simple extrapolation misses out on many important factors, such as:

The inputs to AI progress — most notably compute, but also labor, data, etc. — won’t keep growing at the same rates forever. There’s a significant chance that growth rates will slow in the near future e.g. as we run up against limits of chip production, investment, recruiting pipelines, energy, etc. This could cause the trend to bend downwards.
Automation of AI R&D. Already many AI researchers claim that AI is accelerating their work.13 The extent to which it is actually accelerating their work is unfortunately unclear, but probably there is a nonzero effect already and probably this acceleration effect will increase as AIs become more capable. This could cause the trend to bend upwards.
Superexponential time horizon growth (independent from AI R&D automation). Eventually there will be AI systems which outperform humans at all horizon lengths; therefore, the trend should eventually shoot to infinity.14 Therefore, we think we should use a superexponential trend rather than an exponential trend. (This is confusing and depends on how you interpret horizon lengths, see here for more discussion. If you disagree with this, our model allows you to use an exponential trend if you like, or even subexponential.)

Our model up through AC still centrally involves the METR trend, but it attempts to incorporate the above factors and more.15 It also enables us to better represent/incorporate uncertainty, since we can do Monte Carlo simulations with different parameter settings.

Stage 2: Automating research taste

Besides coding, we track one other type of skill that is needed to automate AI software R&D: research taste. While automating coding makes an AI project faster at implementing experiments, automating research taste makes the project better at setting research directions, selecting experiments, and learning from experiments.

Stage 2 predicts how quickly we will go from an automated coder (AC) to a Superhuman AI researcher (SAR), an AI with research taste matching the top human researcher.

Superhuman AI Researcher (SAR): A SAR can fully automate AI R&D, making all human researchers obsolete.16

The main drivers of how quickly Stage 2 goes is:

How much automating coding speeds up AI R&D. This depends on a few factors, for example how severely the project gets bottlenecked on experiment compute.
How good AIs’ research taste is at the time AC is created. If AIs are better at research taste relative to coding, Stage 2 goes more quickly.
How quickly AIs get better at research taste. For a given amount of inputs to AI progress, how much more value does one get per experiment?

Stage 3: The intelligence explosion

Finally, we model how quickly AIs are able to self-improve once AI R&D is fully automated and humans are obsolete. The endpoint of Stage 3 is asymptoting at the limits of intelligence.

The primary milestones we track in Stage 3 are:

Superintelligent AI Researcher (SIAR). The gap between a SIAR and the top AGI project human researcher is 2x greater than the gap between the top AGI project human researcher and the median researcher.17
Top-human-Expert-Dominating AI (TED-AI). A TED-AI is at least as good as top human experts at virtually all cognitive tasks. (Note that the translation in our model from AI R&D capabilities to general capabilities is very rough.)18
Artificial Superintelligence (ASI). The gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks.19

In our simulations, we see a wide variety of outcomes ranging from a months-long takeoff from SAR to ASI, to a fizzling out of the intelligence explosion requiring further increases in compute to get to ASI.

To achieve a fast takeoff, there usually needs to be a feedback loop such that each successive doubling of AI capabilities takes less time than the last. In the fastest takeoffs, this is usually possible via a taste-only singularity, i.e. the doublings would get faster solely from improvements in research taste (and not increases in compute or improvements in coding). Whether a taste-only singularity occurs depends on which of the following dominates:

The rate at which (experiment) ideas become harder to find. Specifically, how much new “research effort” is needed to achieve a given increase in AI capabilities.
How quickly AIs’ research taste improves. For a given amount of inputs to AI progress, how much more value does one get per experiment?

Continued improvements in coding automation matter less and less, as the project gets bottlenecked by their limited supply of experiment compute.

Timelines and takeoff forecasts

The best place to view our results is at https://www.aifuturesmodel.com/forecast.

In this section we will discuss both our model’s outputs and our all-things-considered views. As previously mentioned, we are uncertain, and don’t blindly trust our models. Instead we look at the results of the model but then ultimately make adjustments based on intuition and other factors. Below we describe the adjustments that we make on top of this model, and the results.

Eli

Here is the model’s output with my parameters along with my all-things-considered views.

To adjust for factors outside of the model, I’ve: lengthened timelines (median from late 2030 to mid 2032), driven primarily by unknown model limitations and mistakes and the potential for data bottlenecks that we aren’t modeling. In summary:

Unknown model limitations and mistakes. With our previous (AI 2027) timelines model, my instinct was to push my overall forecasts longer due to unknown unknowns, and I’m glad I did. My median for SC was 2030 as opposed to the model’s output of Dec 2028, and I now think that the former looks more right. I again want to lengthen my overall forecasts for this reason, but less so because our new model is much more well-tested and well-considered than our previous one, and is thus less likely to have simple bugs or unknown simple conceptual issues.
Data bottlenecks. Our model implicitly assumes now that any data progress is proportional to algorithmic progress. But data in practice could be either more or less bottlenecking. My guess is that modeling data would lengthen timelines a bit, at least in cases where synthetic data is tough to fully rely upon.

I will also increase the 90th percentile from 2062. My all-things-considered distribution is: 10th percentile 2027.5, 50th percentile 2032.5, 90th percentile 2085. You can see all of the adjustments that I considered in this supplement.

Now I’ll move on to takeoff.

To get my all-things-considered views I: increase the chance of fast takeoff a little (I change AC to ASI in <1 year from 26% to 30%), and further increase the chance of <3 year takeoffs (I change the chance of AC to ASI in <3 years from 43% to 60%).

The biggest reasons I make my AI-R&D-specific takeoff a bit faster are:

Automation of hardware R&D, hardware production, and general economic automation. We aren’t modeling these, and while they have longer lead times than software R&D, a year might be enough for them to make a substantial difference.
Shifting to research directions which are less compute bottlenecked might speed up takeoff, and isn’t modeled. Once AI projects have vast amounts of labor, they can focus on research which loads more heavily on labor relative to experiment compute than current research.

(1) leads me to make a sizable adjustment to the tail of my distribution. I think modeling hardware and economic automation would make it more likely that if there isn’t a taste-only singularity, we still get to ASI within 3 years.

I think that, as with timelines, for takeoff unknown limitations and mistakes in expectation point towards things going slower. But unlike with timelines, there are counter-considerations that I think are stronger. You can see all of the adjustments that I considered in this supplement.

Daniel

First, let me say a quick prayer to the spirit of rationality, who infrequently visits us all:

On the subject of timelines, I don’t immediately know whether my all-things-considered view should be more or less bullish than the model. Here are a few considerations that seem worth mentioning to me:

First of all, this model is in-the-weeds / gearsy. (Some people might call it “inside-viewy” but I dislike that term.) I think it’s only appropriate to use models like this if you’ve already thought through more straightforward/simple considerations like “Is the phenomena in question [AGI] even possible at all? Do serious experts take it seriously? Are there any obvious & solid arguments for why this is a nothingburger?” I have thought through those kinds of things, and concluded that yes, AGI arriving in the next decade seems a very serious possibility indeed, worthy of more gearsy investigation. If you disagree or are curious what sorts of considerations I’m talking about, a partial list can be found in this supplement.
I think this model is the best model of AI R&D automation / intelligence explosion that currently exists, but this is a very poorly understood phenomenon and there’s been very little attention given to it, so I trust this model less when it comes to takeoff speeds than I do when it comes to timelines. (And I don’t trust it that much when it comes to timelines either! It’s just that there isn’t any single other method I trust more…)

I notice a clash between what the model says and my more intuitive sense of where things are headed. I think probably it is my intuitions that are wrong though, which is why I’ve updated towards longer timelines; I’m mostly just going with what the model says rather than my intuitions. However, I still put some weight on my intuitive sense that, gosh darn it, we just aren’t more than 5 years away from the AC milestone – think about how much progress has happened over the last 5 years! Think about how much progress in agentic coding specifically has happened over the last year! Over the past decade I’ve learned to trust my intuitive sense of where things are headed a decent amount, because it’s worked pretty well over the years (e.g. What 2026 Looks Like was basically entirely intuition-based)
More detail on vibes/intuitions/arguments:
- I’ve been very unimpressed by the discourse around limitations of the current paradigm. The last ten years have basically been one vaunted limitation after another being overcome; Deep Learning has hit a wall only in the sense that Godzilla has hit (and smashed through) many walls.
- However, two limitations do seem especially plausible to me: Online/continual learning and data efficiency. I think there has been some progress in both directions over the past years, but I’m unclear on how much, and I wouldn’t be that surprised if it’s only a small fraction of the distance to human level.
- That said, I also think it’s plausible that human level online/continual learning is only a few years away, and likewise for data-efficiency. I just don’t know. (One data point: claim from Anthropic researcher)
- Meanwhile, I’m not sure either of those things are necessary for AI R&D to accelerate dramatically due to automation. People at Anthropic and OpenAI already report that things are starting to speed up due to AI labor, and I think it’s quite plausible that massively scaled-up versions of current AI systems (trained on OOMs more diverse RL environments, including many with OOMs longer horizon lengths) could automate all or almost all of the AI R&D process. The ability to learn from the whole fleet of deployed agents might compensate for the data-inefficiency, and the ability to manage huge context window file systems, update model weights regularly, and quickly build and train on new RL environments might compensate for lack of continual learning.
- And once AI accelerates dramatically due to automation, paradigm shifts of the sort mentioned above will start to happen soon after.
- Summing up: Qualitatively, my intuitive sense of what’s going to happen in the next few years is, well, basically the same sequence of events described in AI 2027, just maybe taking a year or two longer to play out, and with various other minor differences (e.g. I don’t expect any one company to have as much of a lead as OpenBrain does in the scenario).
I’m also quite nervous about relying so much on the METR horizon trend. I think it’s the best single source of evidence we have, but unfortunately it’s still pretty limited as a source of evidence.
- It is uncertain how it’ll extrapolate into the future (exponential or superexponential? If superexponential, how superexponential? Or should we model new paradigms as a % chance per year of changing the slope? What even is the slope right now, it seems to maybe be accelerating recently?)
- …and also uncertain how to interpret the results (is a 1 month 80% horizon enough? Or do we need 100 years?).
- There are also some imperfections in the methodology which complicate things. E.g. if I understand correctly the human baseliners for the various tasks were not of the same average skill level, but instead the longer-horizon tasks tended to have higher-skill human baseliners. Also, the sigmoid fit process is awkwardly non-monotonic, meaning there are some cases in which a model getting strictly better (/worse) at some bucket of tasks can decrease (/increase) its METR-reported horizon length! My guess is that these issues don’t make a huge difference in practice, but still. I hope that a year from now, it becomes standard practice for many benchmark providers to provide information about how long it took human baseliners to complete the tasks, and the ‘skill level’ of the baseliners. Then we’d have a lot more data to work with.
- Also, unfortunately, METR won’t be able to keep measuring their trend forever. It gets exponentially more expensive for them to build tasks and collect human baselines as the tasks get exponentially longer. I’m worried that by 2027, METR will have basically given up on measuring horizon lengths, which is scary because then we might not be able to tell whether horizon lengths are shooting up towards infinity or continuing to grow at a steady exponential pace.
- I think a much better trend to extrapolate, if only we had the data, would be coding uplift. If we had e.g. every 6 months for the past few years a high-quality coding uplift study, we could then extrapolate that trend into the future to predict when e.g. every engineer would be a 10x engineer due to AI assistance. (Then we’d still need to predict when research taste would start to be noticeably uplifted by AI / when AIs would surpass humans in research taste; however, I think it’s a reasonable guess right now that when coding is being sped up 10x, 100x, etc. due to highly autonomous AI coding agents, research taste should be starting to improve significantly as well.20 At least I feel somewhat better about this guess than I do about picking any particular threshold of METR horizon length and guessing that it corresponds to a particular level of experiment selection skill, which is what we currently do.)
Relatedly, I’m also interested in the simple method of extrapolating AI revenue growth trends until AI revenue is most of the world economy. That seems like a decent proxy for when AGI will be achieved. I trust this method less than our model for obvious reasons, but I still put some weight on it. What does it say? Well, it says “Early 2030s.” OK.
I’m also interested in what our model says with a pure exponential trend extrapolation for METR instead of the superexponential (I prefer the superexponential on theoretical grounds, though note also that there seems to be a recent speeding up of the METR trend and a corresponding speedup in the trend on other benchmarks). Pure exponential trend, keeping my other parameters fixed, gets to AC 5 years later, in 2034. That said, if we use the more recent ~4 month doubling time that seems to characterize the RL era, even an exponential trend gets to AC in 2030, keeping other parameters fixed. I’m not sure I should keep my other parameters fixed though, particularly the AC coding time horizon requirement seems kinda up in the air since the change to exponential slope corresponds to a change in how I interpret horizon lengths in general.21
- One factor weighing on my mind is the apparent recent speedup in AI capabilities progress–e.g. the slope of the METR trend seems notably higher since 2024 than it was before. This could be taken as evidence in favor of a (more) superexponential trend overall…
- However, I’m currently leaning against that interpretation, for two reasons. First, the speedup in the trend isn’t just for the METR trend, it’s also for other benchmarks, which are not supposed to be superexponential. Secondly, there’s another very plausible explanation for what’s going on, which is that starting in 2024 the companies started scaling up RL a lot. But they won’t be able to keep scaling it at the same pace, because they’ll run into headwinds as RL becomes the majority of training compute. So on this view we should expect the rate of growth to revert towards the long-run average starting about now (or however long it takes for RL compute to become the majority of total training compute).
- That said, I still think it’s plausible (though not likely) that actually what we are seeing is the ominous uptick in the rate of horizon length growth that is predicted by theory to happen a year or two before horizon lengths shoot to infinity.
Also, like Eli said above, I feel that I should err on the side of caution and that for me that means pushing towards somewhat longer timelines.
Finally, I have some private info which pushes me towards somewhat shorter timelines in expectation. My plan is to circle back in a month or three when more info is available and update my views then, and I currently expect this update to be towards somewhat shorter timelines though it’s unclear how much.

Weighing all these considerations, I think that my all-things-considered view on timelines will be to keep the median the same, but increase the uncertainty in both directions, so that there’s a somewhat greater chance of things going crazy in the next year (say, 9% by EOY 2026) and also a somewhat greater chance of things taking decades longer (say, still 6% that there’s no AGI even in 2050).

So, here’s my all-things-considered distribution as of today, Dec 30 2025:

On takeoff speeds:

I think my thoughts on this are pretty similar to Eli’s, modulo differences implied by our different parameter settings. Basically, take what the model (with my parameters) says, and then shift some probability mass away from the slower end and put it on the faster end of the range.

Also, whereas our model says that takeoff speeds are correlated with timelines such that shorter timelines also tends to mean faster takeoff, I’m not sure that’s correct and want to think about it more. There’s a part of me that thinks that on longer timelines, takeoff should be extremely fast due to the vast amounts of compute that will have piled up by then and due to the compute-inefficiency of whatever methods first cross the relevant thresholds by then.

So here’s a quick distribution I just eyeballed:

What info I’ll be looking for in the future & how I’ll probably update:

Obviously, if benchmark trends (especially horizon length) keep going at the current pace or accelerate, that’ll be an update towards shorter timelines. Right now I still think it’s more likely than not that there’ll be a slowdown in the next year or two.
I’m eager to get more information about coding uplift. When we have a reliable trend of coding uplift to extrapolate, I’ll at the very least want to redo my estimates of the model parameters to fit that coding uplift trend, and possibly I’d want to rethink the model more generally to center on coding uplift instead of on horizon length.
If AI revenue growth stays strong (e.g. 4xing or more in 2026) that’s evidence for shorter timelines vs. if it only grows 2x or less that’s evidence for longer timelines.
I’m eager to get more information about the ‘slope’ of the performance-as-a-function-of-time graph for various AI models, to see if it’s been improving over time and how far away it is from human performance. (See this discussion) This could potentially be a big update for me in either direction.
As for takeoff speeds, I’m mostly interested in thinking more carefully about that part of our model and seeing what improvements can be made.22 I don’t think there’ll be much empirical evidence one way or another in the next year. Or rather, I think that disputes about the proper way to model takeoff matter more than evidence about the value of various parameters, at this stage. That said, I’ll be keen to get better estimates of some of the key parameters too.
Of course I’m also interested to hear the feedback/criticism/etc. from others about the model and the parameters and the overall all things considered view. I wouldn’t be surprised if I end up changing my mind significantly on the basis of arguments I haven’t thought of yet.
…this list is nowhere near exhaustive but that’s enough for now I guess.

Comparison to our previous (AI 2027) timelines and takeoff models

These section focus specifically on the model results with Eli’s parameter estimates (for both the AI Futures Model and the AI 2027 model).

Added Jan 2026: see here for clarifications regarding how our forecasts have changed since AI 2027.

Timelines to Superhuman Coder (SC)

This section focuses on timelines to superhuman coder (SC), which was our headline milestone in our AI 2027 timelines model: an SC represents an AI that autonomously is as productive as an AGI project modified to have all coders as competent as their best, speeding them each up by 30x, and getting 30 copies of each of them.23

We’ll discuss only the AI 2027 time horizon extension model in this section, due to it being simpler than the benchmarks and gaps version.24 Below we compare the forecasted distribution of the AI 2027 model against that of the AI Futures Model.

Edited Jan 8: updated the above figure and below description to fix an issue, moving the new model’s SC timelines back slightly.

We see that the AI Futures Model median is 5 years later than the AI 2027 model, and that it assigns a 9% chance that SC happens before the time horizon extension’s median. From now onward, we will focus on the trajectory with median parameters rather than distributions of SC dates, for ease of reasoning.

The AI 2027 time horizon extension model, with parameters set to their median values, predicts SC in Jan 2027 given superexponential-in-effective-compute time horizon growth, and SC in Sep 2028 given exponential time horizon growth. Meanwhile, the new model with median parameters predicts SC in Dec 2031. This is a 3.25-5 year difference! From now on we’ll focus on the 5 year difference, i.e. consider superexponential growth in the time horizon extension model. This is a closer comparison because in our new model, our median parameter estimate predicts superexponential-in-effective-compute time horizon growth.

The biggest reason for this difference is that we model pre-SC AI R&D automation differently, which results in such automation having a much smaller effect in our new model than in the AI 2027 one. The 5 year increase in median comes from:

Various parameter estimate updates: ~1 year slower. These are mostly changes to our estimates of parameters governing the time horizon progression. Note that 0.6 years of this is from the 80% time horizon progression being slower than our previous median parameters predicted, but since we are only looking at 80% time horizons we aren’t taking into account the evidence that Opus 4.5 did well on 50% time horizon.
Less effect from AI R&D automation pre-SC: ~2 years slower. This is due to:
1. Taking into account diminishing returns: The AI 2027 timelines model wasn’t appropriately taking into account diminishing returns to software research. It implicitly assumes that exponential growth in software efficiency is not getting “harder” to achieve, such that if AIs gave a software R&D uplift of 2x in perpetuity, the software efficiency growth rate would speed up by 2x in perpetuity. We didn’t realize this implicit assumption and have now fixed it.
2. Less AI software R&D uplift from pre-SC AIs: The interpolation method used to get AI software R&D uplift values in the AI 2027 model in between present day and SC gave much higher intermediate values than the uplift we end up with in our new model. We previously modeled 50% of the way to SC in effective compute OOMs as resulting in 50% of the way to SC in terms of log(uplift), but our new model is more pessimistic. Partially, this is because the AI 2027 model had a bug in how AI software R&D was interpolated between present AIs and SC. But that only accounts for half of the difference, the other half comes from us choosing an interpolation method that was more optimistic about pre-SC speedups than the AI Futures Model.
Compute and labor input time series adjustments: ~1 year slower. That is, we now project slower growth in the leading AI project’s compute amounts and in their human labor force. Read about the AI Futures Model’s input time series here.
Modeling experiment compute: ~1 year slower. Previously we were only modeling labor as an input to software progress, not experiment compute.

You can read more about these changes and their effects in our supplementary materials.

Takeoff from Superhuman Coder onward

The AI Futures Model predicts a slower median takeoff than our AI 2027 takeoff model. Below we graph each of their forecasted distributions for how long it will take to go from SC to ASI.

Edited Jan 8: updated the above figure and below description to fix an issue, moving the new model’s takeoff to be a bit slower.

We see that while the AI Futures Model’s median is longer than the AI 2027 one, it still puts 38% probability of takeoff as fast as AI 2027’s median. On the other hand, the AI Futures Model’s cumulative probability gets closer to the AI 2027 model as the AC to ASI year amount increases. The new model is less “binary” in the sense that it gives lower probability to very fast or very slow takeoffs. This is because the AI Futures Model models increases in the compute supply.25

The reason the AI Futures Model gives a lower chance of fast takeoffs is primarily that we rely on a new framework for estimating whether there’s an SIE and how aggressive it is.

Our AI 2027 takeoff model predicted the progression of capabilities post-SC. Its methodology was also fairly simple. First, we enumerated a progression of AI capability milestones, with a focus on AI R&D capabilities, though we think general capabilities will also be improving. Then, for each gap between milestones A and B, we:

Human-only time: Estimated the time required to go from milestone A to B if only the current human labor pool were doing software research.
AI R&D progress multiplier (what we now call AI software R&D uplift, or just AI R&D uplift): Forecasted how much AI R&D automation due to each of milestones A and B will speed up progress, then run a simulation in which the speedup is interpolated between these speedups over time to get a forecasted distribution for the calendar time between A and B.

In order to estimate some of the human-only time parameters, the AI 2027 takeoff forecast relied on a parameter it called r, which controlled the diminishing returns to AI R&D. It was crudely estimated by backing out the implied r from the first human-only time requirement, which was to get from SC to SAR.

The AI 2027 model assumed that there were no compute increases; under this assumption, if it r>1 then successive doublings of AI R&D uplift (what we previously called progress multiplier) gets faster over time after full AI R&D automation. Others have referred to this possibility as a software intelligence explosion (SIE). In the model, each doubling took about 0.7x as long as the previous: we’ll call the ratio of successive uplift doublings b from here onward, i.e. b<1 means successive doublings are faster and we get an SIE.26

In the AI Futures Model, the condition for an SIE is more complicated because we model multiple types of AI R&D; we also include compute increases, departing significantly from the behavior of an SIE. That said, there is a similar understandable concept in our model: a taste-only singularity (TOS). This is the situation in which after full AI R&D automation and with only research taste improvements (no extra coding or compute), successive doublings of AI R&D uplift get faster over time. To make the analysis much simpler, we also ignore the limits of intelligence in our analysis; these usually don’t greatly affect the takeoff to ASI, but they do slow progress down somewhat.

Under these assumptions, we can define a similar b to that analyzed in an SIE.

We estimate b by combining the following parameters:27

(a) the ratio of top to median researchers’ value per selected experiment

(b) how quickly AIs improve at research taste as effective compute increases

(c) the rate at which software R&D translates into improved software efficiency (intuitively, the rate at which ideas are getting harder to find).

When using this framework, we get a less aggressive result (with our median parameters). Given that (a) was explicitly estimated in the AI 2027 model, and that we have a fairly aggressive estimate of (c) in the new model, implicitly most of the difference in results is coming from (b), how quickly AIs improve at research taste. We estimated this in our new model by looking at historical data on how quickly AIs have moved through the human range for a variety of metrics (more on that here).

With the AI 2027 model’s median parameters, each successive doubling of uplift took roughly 66% of the length of the previous (i.e. b=0.7).28 The AI Futures Model’s distribution of b is below.

In the AI Futures Model’s median case, there isn’t a TOS: each doubling would take 20% longer than the previous if taste were the only factor.29 But we have high uncertainty: 38% of our simulations say that successive doublings get faster, and 17% are at least as aggressive as the AI 2027 model (i.e. b<0.7).30

Remember that unlike the AI 2027 model, the AI Futures Model models compute increases; also in practice coding automation contributes some to takeoffs.31 Therefore, at similar levels of the bs we’ve defined here, takeoff in the AI Futures Model is faster.

Faster takeoffs are also correlated in our model with shorter timelines: when we filter for simulations that achieve SC in 2027, 35% of them have a b lower than the AI 2027 model’s median parameters. This is because some parameters lead to larger effects from automation both before and after SC, and furthermore we specified that there be correlations between parameters that govern how quickly coding abilities improve, and how quickly research taste abilities improve. We discuss this correlation in our results analysis.

For further analysis of the differences between our AI 2027 and new takeoff models, see our supplementary materials.

AGI stands for Artificial General Intelligence, which roughly speaking means AI that can do almost everything. Different people give different definitions for it; in our work we basically abandon the term and define more precise concepts instead, such as AC, SIAR, TED-AI, etc. However, we still use the term AGI when we want to vaguely gesture at this whole bundle of concepts rather than pick out one in particular. For example, we’ve titled this section “AGI timelines…” and the next section “Post-AGI takeoff…” because this section is about estimating how many years there’ll be until the bundle of milestones starts to be reached, and the next section is about estimating what happens after some of them have already been reached.

2047 for “unaided machines outperforming humans in every possible task”, and 2116 for “all human occupations becoming fully automatable.”

Some have also done extrapolations of Gross World Product, such as David Roodman’s Modeling the Human Trajectory.

More details:

Hans Moravec predicted human-level AI by 2010 for supercomputers and 2030 for personal computers in 1988, later revising to 2040 in 1999, with machines far surpassing humans by 2050.
Ray Kurzweil predicted in 1999 that AI would achieve human-level intelligence by 2029 and has consistently maintained this specific timeline for over 25 years.
Shane Legg has maintained a remarkably consistent AGI prediction since 2001, assigning 50% probability to human-level AGI by 2028. He went on to found DeepMind and has not changed his timeline.

Technically, the report predicted the arrival of Transformative AI, or TAI, which was defined as having at least as big of an impact as the Industrial Revolution.

Rule of thumb inspired by Lindy’s Law: It’s reasonable to guess that a trend will continue for about as long as it’s been going so far. We wouldn’t dream of confidently extrapolating this trend for thirty years, for example. (We do in fact run the model into the 2050s and onward in our Monte Carlos, but we acknowledge that the probability of reality diverging dramatically from the model increases with the duration of the extrapolation.)

Peter Wildeford has a model which has the possibility of doublings getting easier or harder, but does not model AI R&D automation or changes to labor or compute growth.

GATE and the Full Takeoff Model also model the progression after full AI R&D automation, but neither of their authors claim that their model is intended to do it well.

These estimates are then shaded up to account for capability improvements at the same compute level in addition to efficiency improvements at the same performance level. This adjustment brings the methodology closer to ours, but still we think it’s helpful to focus specifically on research taste skills. And finally, in Davidson and Houlden, everything is converted to the units of gains in the number of parallel workers, which we view as a much less natural unit than research taste quality.

Among other advantages of having an integrated model: our model itself already bakes in most of the various adjustments that Davidson and Houlden did ad-hoc to their estimate of r, and we can generally ensure reasonable starting conditions (as opposed to Davidson and Houlden’s gradual boost).

Our model operationalizes AC as follows: An AC, if dropped into present day, would be as productive on their own as only human coders with no AIs. That is, you could remove all human coders from the AGI project and it would go as fast as if there were only human coders. The project can use 5% of their compute supply to run ACs.

See especially this Anthropic survey of researchers claiming >100% productivity improvements, but also this METR uplift study which found that people systematically overestimate the amount of uplift they were getting from AI assistance.

That is, if we think that eventually there will be an AI system which outperforms humans at all horizon lengths, then that means the trend must shoot to infinity in finite time.

That is, the part of our model that deals with AI timelines, i.e. the length of the period leading up to the “automated coder” milestone, centrally involves the METR trend. After that milestone is reached, horizon length continues to increase but isn’t directly relevant to the results. The results are instead driven by increases in automated research taste and coding automation efficiency.

Our model operationalizes SAR as follows: if dropped into an AGI project in present day, a SAR would be as good at research taste as if there were only human researchers, who were each made as skilled as the top researcher.

In our model, the SAR’s compute budget is 1% of the company’s compute. The SAR must also qualify as an automated coder (AC), i.e. be able to fully automate coding at the AGI project with 5% of its compute budget (an AC has a higher compute budget because more human labor is currently spent on coding than experiment ideation/selection).
Our operationalization is the same as what we defined as a superhuman AI researcher (SAR) from our previous work, except that it only needs to match the productivity of the project’s best researchers, rather than the project’s best researchers sped up by 30x and 30x more numerous. We also separate out research taste from coding.

What do we mean when we say that the gap between a top human researcher and SIAR is 2x greater than that between the median and top human researcher? We mean the following. First, let’s define a transformation between AIs’ capability level b and a number of SDs relative to the median as:

Assume that we can infinitely sample from the process that produced the relevant human population, in this case the human researchers at the AGI project. For example, we could imagine sampling from parallel Earths.
Look up what percentile p the AI has within this population, based on its capability level b.
An AIs’ SDs relative to the median, s, is equal to the inverse normal CDF of p.

Now, using s as the result of applying the process above, define a SIAR as reached when (s_siar - s_top_human) = 2 * (s_top_human - s_median_human). A SIAR has the same budget constraints as a SAR, and must also meet the requirements for an AC.

Our model operationalizes TED-AI as follows: A TED-AI is an AI system that could, if dropped into the present day & given the resources of a large tech company & three months to prep, fully automate 95% of remote work jobs in the US. It need not be able to do all 95% at the same time (perhaps there isn’t enough compute to run enough copies of the TED-AI for that), but it needs to be able to do any 10% of them using only 50% of the US’s AI-relevant compute.

Our model operationalizes ASI as follows: An ASI would, if dropped into present day & given the resources of a large tech company & three months to prep, be able to fully automate 95% of remote work jobs in the US to the level where it is qualitatively 2x as much above the best human as the best human is above the median professional. Also, here we define “the median professional” not as the actual median professional but rather as what the the median professional would be, if everyone who took the SATs was professionally trained to do the task. (We standardize the population that is trained to do the task because otherwise the ASI requirement might be quite different depending on the population size and competence levels of the profession. See above regarding how we define the 2x gap.)

Spot-checking in our model: Serial coding labor multiplier is basically the square root of parallel coding labor multiplier, and so when I look at my default parameter settings at the point where serial coding labor multiplier is ~10x (May 2030) the AIs have research taste equivalent to the median AI company researcher. Sounds about right to me.

I’ve talked about this elsewhere but I generally think that if you don’t like using a superexponential and insist on an exponential, you need to come up with a different interpretation of what it means for a model to have horizon length X, other than the natural one (“A model has horizon length X iff you are better off hiring a human for coding tasks that take humans much longer than X, but better off using the model for coding tasks that take humans much less than X.”) Because on that interpretation, an exponential trend would never get to a model which outperforms humans at coding tasks of any length. But we do think that eventually there will be a model which outperforms humans at tasks of any length. In other words, on the natural interpretation the trend seems likely to go to infinity in finite time eventually. You can try to model that either as a smooth superexponential, or as a discontinuous phase shift… even in the latter case though, you probably should have uncertainty over when the discontinuity happens, such that the probability of it happening by time t increases fairly smoothly with t.

For example, I want to think more about serial speed bottlenecks. The model currently assumes experiment compute will be the bottleneck. I also want to think more about the software-only-singularity conditions and whether we are missing something there, and square this with soft upper bounds such as “just do human uploads.”

Note that with the new model, we’ve moved toward using Automated Coder (AC) as the headline coding automation milestone, which has a weaker efficiency requirement.

That said, we note that the benchmarks and gaps version had longer median SC timelines (Dec 2028). And Eli’s all-things-considered SC median was further still in 2030, though Daniel’s was 2028.

That said, we still think that the AI Futures Model gives too low a probability of <10 year takeoffs, because we are not modeling growth in compute due to hardware R&D automation, hardware production automation, or broad economic automation.

As discussed here, the AI 2027 model set r=2.77 and 1.56 at different points. b=2^(1/r-1), so b=0.64 to 0.78.

See here for a more thorough explanation of how b is calculated from our new model’s parameters.

2^((1/2)-1) gives roughly 0.7. See how we got these numbers here.

2^((0.315/0.248)-1). See the justification for this formula on our website.

Note that the minimum b in our model is 0.5. This is a limitation, but in practice, we can still get very fast takeoffs. For example, if b were 0.5 and didn’t change over time, this would lead to a finite-time singularity in 2 times longer than the initial uplift doubling time.

This could also be influenced by the uplifts being different for different milestones, or other factors. Unfortunately we haven’t had a chance to do a deep investigation, but a shallow investigation pointed toward compute increases being the primary factor.

Early US policy priorities for AGI

Nick Marsh — Tue, 09 Dec 2025 01:11:06 GMT

This is a guest post from Nick Marsh, a visiting fellow at Constellation who’s worked closely with AI Futures over the past months. The views within are not necessarily the views of AI Futures as an organization (although most AIFP employees tentatively agree with them). To any adversarial readers who want to dunk on our organizational policy recs, just wait a couple of months. We intend to publish a much more comprehensive “positive vision for AGI”, which will have much juicer targets to criticize.

Trying to figure out which policies might help us prepare for AGI is hard. Some proposals look great at first glance, but do not stand up to scenario scrutiny – tracing through, their effects look messy and confusing, and it’s hard to tell whether they’d help or hinder us going forward.

This post is nevertheless an attempt to outline some early policy priorities for the US government that we’re more confident in. (‘Early’ meaning within the next two or three years.) We’ll first discuss a frame – Plans A, B, C, and D – we’ve been using internally to reason about AGI strategy. Then we’ll outline what we think the two core policy priorities for the US are: building situational awareness and preparing to coordinate with adversaries. These policies would help us get into Plan A and B worlds, which are much better.

There are only a few specific policies that we’re entirely comfortable with. This post sketches two — creating a select committee on AGI and implementing sensible chip policy (building a chip registry and an inference-only retrofitting package for data centers) — and concludes with some we tentatively like but are less certain about.

Some background: Plans A, B, C and D

AGI strategy lacks a shared vocabulary. In particular, people discuss fairly lossy abstractions, jumping between potential institution designs (e.g. a Manhattan Project, an Apollo Project, a CERN for AI, Intelsat) without much discussion of how we get to those institutions or what the broader strategic picture would look like, or – maybe most importantly – how we get out of those institutions and into a post-AGI world.

In particular, people sometimes talk past each other regarding plans for approaching transformative AI – there’s some kernel of truth within many of the plans proposed: a Manhattan project to win a race with China; international coordination or a global megaproject; implementing export controls to reduce China’s access to compute; removing export controls to keep China dependent on US chips; MAIM; shutting the whole thing down.

Internally, we’ve been using the terms Plan A, B, C and D to describe different plans that could be pursued to develop AGI (and to categorise scenarios that result from following them). We’ve gotten a lot of mileage out of them in reasoning about AGI strategy.

The below summaries are quite minimal; AI Futures will publish substantially more work in the coming months that uses and expands on this framing.

Plan A (other names: ten year takeoff, managed takeoff, international coordination)
Plan A revolves around an international agreement to slow down takeoff. The parties to this agreement (chiefly the US and China) need to be sure that the other isn’t training models beyond the agreement – so successful execution of Plan A would require verification mechanisms (e.g. chip tracking, on-chip mechanisms, an international chip registry).
Plan B (other names: burn the lead)
Plan B requires the US government to be bought-in, but without a strong international agreement. This puts the US in a situation where the goal is to build a robust US capabilities lead (of ≥1 year), building maximum controllable AGIs, and then burn that lead on automated alignment research. This would involve coordinating with willing AGI projects and sabotaging those that aren’t cooperative (e.g. Chinese AGI projects).
Plan C (other names: slowdown)
Plan C worlds involve a lab-centric race. There’s limited US government involvement, but AGI development is led by one or more companies who are somewhat concerned about alignment and aim to consolidate as much compute and power as possible to speed up development, in order to buy time to burn on alignment later.
Plan D (other names: race)
Plan D involves private labs racing each other to build ASI first – the leading company is not concerned about alignment, but there are a handful of employees who are working on those risks. The government may have some very limited oversight, but ultimately it’s left in the companies’ hands.

Plan A is much better than the other plans

We think it’s far more likely we avoid catastrophic outcomes (avoiding takeover or AI-enabled dictatorship) and end up with a good future in Plan A than Plan B, and similar for Plan B over Plans C and D. To be precise: we think that existential risks from AI or human takeover are at least half as likely in Plan A worlds compared to Plan B worlds.

We think US-China race dynamics in Plan B – where the US government is highly bought in, and significant national resources are dedicated to racing with China – are pretty terrible, compared to worlds where a deal is struck. In particular, it’s unclear how we exit a race:

the US could aim to build and deploy a human-controlled decisive strategic advantage and unilaterally impose its will on China (and the rest of the world);
the US could aim to hand off control to potentially-misaligned superhuman AIs earlier, which might make building a DSA significantly easier;
China could realise that the US is aiming for one of these, and go up the escalation ladder in response, potentially risking WWIII; or
China and the US could strike a deal later – which is likely harder to verify (after years of chip production in an adversarial environment) and negotiated under higher-stakes conditions in a degraded information environment.

Plan B also concentrates power significantly compared to Plan A. It’s easier for leaders to claim that exclusive access to models is required for strategic reasons and that transparency measures would slow down the US, making it easier to amass power.

We think Plan C and D worlds are even worse. Without government oversight, it seems extremely likely that at least one lab rushes towards misaligned superintelligence and we end up with AI takeover. Meanwhile the risk of China freaking out and escalating to war isn’t massively smaller (and possibly not smaller at all) than it is in Plan B.

We want to increase the likelihood we make it into a Plan A world. Secondarily, we want to improve the expected execution of all the plans, given that we think (as of now) Plans C or D are the most likely and Plan A the least likely.

So this post is aiming to outline some early steps that help shift us onto A/B worlds and away from C/D worlds, which we think are significantly more likely to end in doom.

Two core priorities

So, what do the overarching priorities for the next couple of years look like against this background?

First, the US government needs to build situational awareness. As of 2025, there are only a handful of individuals across all three branches of government who understand what’s going on. Almost nobody takes the AI companies seriously when they make clear their intention is to build superintelligence; even a very rudimentary understanding of how AIs are trained hasn’t propagated across the government; very few have tried to wrap their heads around what a takeoff scenario could look like, and what they might do to positively influence one.

Without strong government wakeup (and ability to think through AGI strategy), we remain on-course for Plan C or D worlds, where we leave the future in the hands of lab leaders and race dynamics. Good policymaking requires an alert government.

Second, the US needs to prepare to coordinate internationally. It’s better – and more tractable – to coordinate with China than to try to bully them into submission by racing ahead. But it would be difficult to sign an agreement tomorrow that would give both countries confidence that the other isn’t training models beyond the agreement’s scope.

A small amount of investment into building the institutional, political, and technical capacity to verify that an international agreement is being respected – very small, in light of the stakes of such a negotiation – would widen the bargaining range and move us away from high-risk adversarial dynamics.

Scarier: improving the BATNA

There is another class of policies that might reduce x-risk: actions that improve the best alternative to a negotiated agreement (BATNA) for the US – that is, reduce the risk that racing ends in existential catastrophe. Some of these actions look like improving the US’ ability to race. One example is passing legislation that explicitly permits the federal government to consolidate compute in order to increase the US lead over China, or improving its ability to sabotage adversarial data centers.

Another reason to improve our BATNA is that it must make China’s BATNA worse—at least insofar as the US-China AI race is a zero-sum game. Hence, improving our BATNA raises China’s incentive to strike a deal, making us more likely to reach the Plan A worlds.

We’re generally pretty concerned about these kinds of interventions: they’re higher-risk and escalatory by nature, and might make it less likely we get a deal, both by making racing more attractive to the US and by worsening the security dilemma the US and China are already in.

Other interventions in this camp seem less risky: chiefly those that keep the door open for an agreement during a race, and which improve the epistemics and institutions within the US.

I: Situational awareness

Political will in government to deal with the incoming development of superintelligence is currently extremely low. This is mostly because policymakers do not understand in a visceral, real way that it’s possible that AGI – let alone ASI – is developed within the next few years. And even if they did, the federal government lacks the technical and strategic expertise to build the kind of detailed picture required to pursue sensible policies.

We think that increasing the level of technical and strategic expertise available to the federal government would:

unlock further political will, widening the Overton window and permitting more preparation before crunch time, and
improve strategy and epistemics regarding AGI on the object-level.

We are not saying that members of Congress should go and get PhDs in machine learning. We are, however, saying that to successfully navigate the critical risk period, policymakers will need to understand:

how AIs are trained now, and how training procedures might differ in the future,
the chip supply chain and the geopolitical issues surrounding it,
what takeoff/an intelligence explosion might look like, and why AIs automating AI research is the core problem,
that AGI – let alone superintelligence – could completely break many of our institutions, which rely on human cognition being slow and expensive,
that the alignment problem is potentially hard, and which directions we might pursue to attempt to solve it,
that there are hard technical and geopolitical problems involved in either racing or coordinating with adversaries,
that a significant proportion of AGI development already happens in secret, and that if you don’t ensure that you have oversight soon you may lose your ability to track where the frontier is,
that some good interventions require significant lead time (whereas others can be deferred until later), so you’d better get started on those quickly,
and more.

Leaders need skilled, experienced advisors to get to this point.

A Joint Select Committee for AGI

Congress has some desirable properties as an institution for overseeing AGI development.

For one, it’s composed of many people with differing skillsets, ideologies and values. Insofar as we are concerned by concentration of power – and we think we should be – it seems extremely useful to have a diverse body with access to AGI development that can discuss and legislate in response. Its size also means that there’s more continuity and stability compared to the executive, whereas every four years the priorities and competences of one administration can be replaced with another.

Currently, there are very few congresspeople who are situationally aware regarding AGI development. The vast majority of both houses have little idea of what is coming, and rank risks from AI fairly low on their list of priorities. We – obviously – think that it should be the top priority for lawmakers, and that there’s a significant opportunity to increase Congress’ awareness of the strategic problems associated with AGI development.

Congress as a whole, however, moves very slowly (especially the 119th, which has struggled to pass major legislation so far), and only a few members have the background or interest required to usefully engage with these concerns directly. Moreover, we’d want those members to be able to handle and discuss confidential and classified information, which is hard to do on the open floor.

The existing committee structure is not ideal for this. The natural candidates - Commerce, Science, and Transportation in the Senate and Space, Science and Technology in the House - are huge, overburdened and lack AI-focused subcommittees that could be candidates (and as before, there just aren’t that many members with the required context). AGI is also just a substantially bigger issue that cuts across literally every other committee’s remit.

So we suggest that creating a Joint Select Committee on AGI, a small committee set up with the purpose of investigating the possibility of an intelligence explosion, would be an extremely useful intervention. It could be limited to a period of two years with the explicit function of investigating what the labs are up to and how the US should prepare for potentially rapid AI progress.

An AGI committee could increase Congress’ situational awareness by bringing together the most informed members of both houses, allowing them to hold hearings and issue subpoenas to lab leaders and others (including for confidential information), and giving them an AGI-focused staff to support their operations.

Building talent in the executive

Talent is a major early bottleneck for the executive branch. The table below provides a short description of some talent gaps in the federal government:

Fortunately, there are a wide range of powers available to quickly bring in various kinds of talent, including:

establishing a reserve corps for AI talent (e.g. by using the National Defense Executive Reserve provision in Title VII of the DPA);
directly appointing experts/consultants, or using the IPA to second talent from nonprofits, universities, and other government departments; or
creating an advisory board of civilians with experience in AGI strategy and alignment.

Talent needs are not uniformly distributed throughout the government. In particular, the White House and Commerce (especially CAISI and BIS) would likely differentially benefit from more expertise relative to other departments (for developing strategy and implementing strategy, respectively). Maintaining autonomy and flexibility is a priority for hiring decisions – given that the talent pool will remain limited, trying to squeeze new talent into constrained, functional roles supervised by senior employees lacking AI context would be foolish. (It would also likely deter potential applicants.)

Pitfalls of building talent in the federal government

More generally, there are some predictable failure modes for bringing talent into the executive:

the salaries the federal government can pay employees – even under special hiring authorities – are low compared to what AI talent can attract in the private sector (even at nonprofits);
demands for ideological purity (e.g. only hiring employees from the administration’s party or who share leadership’s exact position on AI) likely lead to poor epistemics;
burdensome managerial structures (and senior leadership who are not thinking clearly about the impacts of AGI) disincentivizes relevant talent from the private sector, used to operating with significant autonomy, from applying;
a significant number of people aiming to position themselves as strong voices on AI strategy either do not believe that superintelligence is a possibility (and so make bad policy proposals), are actively aiming towards ASI regardless of whether it’s aligned, or are self-interested and power-seeking.

These challenges can be overcome by ensuring that you draw on talent from a number of different pools (industry, nonprofits, academia), use your statutory hiring authorities creatively (e.g. seconding from organisations who cover the majority of their salary), and put your best talent in cross-cutting small teams with a wide remit.

There is also a chance that adding talent to the federal government will be net negative, because it could increase the likelihood of bad regulation passing or efforts by the executive branch to consolidate unilateral control over the AI industry and concentrating power. We think that this is, all things considered, a minor negative point – the US remains structurally fairly coup-resistant; much of the concentration-of-power risk comes from lab leaders acting in secret – and that it is still worthwhile to increase the AI competence of the US federal government.

II: Preparing to coordinate

During takeoff, the leading actor will be faced with three options. At any point, they could: (i) race to ASI and hand off trust to a superhuman AI system, (ii) sabotage trailing actors in order to stall for more time, or (iii) make a deal with trailing actors.

The first two options are terrible: handoff probably leads to AI takeover, and sabotage gives you much less time than a deal and plausibly leads to WW3. So perhaps the most robustly good set of interventions involves actively working to keep the possibility of a deal open. We think that these interventions don’t have many negative externalities in Plans A-D, but do significantly increase the likelihood of Plan A occurring.

Start planning explicitly for a (narrow, verifiable) deal as soon as possible. By default, delays make verifying a deal significantly harder: China continues to indigenize its chip supply chain, and there’s simply much more compute in the world to track down. The US government should begin explicitly planning and pushing for an AI related deal. An initial deal would not need to be very intense: we suggest that a good first step is to agree to mutual transparency for AI progress and verification on each other’s large datacenters. This initial agreement lays the bedrock for verifying any future deals that are mutually desirable, while not in and of itself limiting the US strategically.
The US and China should work to generally improve diplomatic relations. Maintaining communication between the White House and Beijing, engage in Track 1 (or 1.5) dialogue, make public cooperative statements, etc.
The US should aim to implement sensible chip policy (discussed below).

Sensible chip policy

There are two obvious ways that a deal like this could fall apart – if:

there’s suspicion that the other side has undisclosed compute stashed away (e.g. in a black site data center, or just as loose compute), allowing it to push beyond the terms of the agreement in the future; or
there’s suspicion that disclosed chips are being used to train models beyond the terms of the agreement.

In response to these concerns, there are two things that the executive branch can immediately begin working on. Respectively,

creating a chip registry – a complete account of the distribution and production of compute, and
preparing a shovel-ready plan to retrofit data centers to be inference-only, including R&D to design and build the required mechanisms and a project to train installers and auditors.

AI Futures will publish more on these two proposals soon – for now, short summaries.

First: a chip registry. The basic notion of a chip registry is extremely simple: a database of AI chips (above a certain threshold for FLOP/s, for instance), tracking their location and owner. It’s fairly straightforward for the US to start a chip registry: they can start by using national technical means, putting the intelligence community to work. It’s fortunate that data centers are (currently) large and run hot; we think that it should be possible for the US to identify a high proportion of the worlds’ compute without the need for international cooperation.

Domestically, it’s even easier – it’s likely possible to use Section 705 of the DPA or other information-gathering authorities to put together a detailed account of compute within the US. Later, the registry can be internationalized or cross-checked with a Chinese compute registry.

Second: a plan to go inference-only. If the only way to ensure that a data center isn’t being used to train a new model is to literally unplug it, you lose a significant amount of value. We should expect leading AI companies to be extremely powerful actors by the time we need to verify a deal, and subsequently expect them to lobby hard against any plans that involve unplugging chips (as this would deprive them of their main revenue source).

Instead, we should allocate public funds (via e.g. DARPA or the CHIPS Act) to incentivize working implementations of an inference-only retrofitting package – a way to ensure that data centers are not being used to train models, instead only permitting running inference on models from an approved list – then do trial runs domestically installing that package on small data centers.

Once a working system is built, regulation (or conditional deregulation) could be used to mandate that new data centers must have fast inference-only retrofitting capacity, and the government would draw up a plan to retrofit a significant amount of compute in the case of a deal. This would require an agency with a few thousand vetted installers and auditors, who would need to be trained to fit the package and ensure that data centers were in compliance.

Note that we aren’t discussing export controls here: this is because we are very uncertain if they are good or bad (opinions within AI Futures vary), and so they are discussed below.

Five low-confidence policies

A large class of interventions (e.g. export controls) involves improving the relative position of the US over China in adversarial cases (that is: they improve the BATNA). We think that these interventions have a strong case for them being good:

The US seems, overall, more likely to use the cosmic endowment wisely.
The US is currently ahead in the race to ASI. Increasing the US lead could allow it to spend a larger fraction of its resources on safety without relinquishing its lead.

However, there are also counterconsiderations. The default outcome of China losing the race (as happens in AI 2027) provides a strong incentive for China to push for a deal. If China is pushing for a deal, and the US doesn’t want to, then increasing the relative position of the US could make a deal much less likely.

This leaves us with significant sign-uncertainty – in worlds where the US would much prefer a deal and China is dragging its feet, improving the US’ ability to race may be useful to bring China to the table. In worlds where the two are closer in desire to reach a deal (e.g. they’re both fairly convinced that they do better in a world with a deal than without), derisking a race could shrink the bargaining range to the point where an agreement becomes less likely.

Some particular policies that others have advocated for, but we aren’t so sure about:

Building secure data centers (at SL5, resistant to state-level actors) would massively reduce the risk of weight theft, but might backfire by reducing visibility into AI progress (bad security is kind of like transparency), and therefore increase the likelihood of a secret intelligence explosion. Increasing security might also worsen race dynamics – if you think your models will be quickly stolen, you have less incentive to push the frontier.
Preparing to sabotage adversarial data centers (e.g. by ramping up offensive cyber) may increase the likelihood that the US gains a significant lead over China (and so could use this lead to push for a deal). However, this could hurt the likelihood of a deal, lead to hardening, which could be bad for the reasons given above, or even lead to preventive attacks from China that escalate into war.
GPU export controls are likely useful for Plan B worlds under short timelines – it would deny China access to significant amounts of compute – but incentivizes both chip smuggling, which makes Plan A harder to execute, and Chinese chip supply chain indigenization, which makes export controls later less effective. We are more excited about proposals along the lines of using export controls to incentivize the development of HEMs, by allowing exports of chips with appropriate verification mechanisms.
Semiconductor manufacturing equipment export controls are somewhat more promising, especially if paired with conditional relaxation of chip export controls for chips with location attestation and/or other hardware verification mechanisms. SME export controls delay China’s ability to indigenize, unlike controls on chips, but the overall effect of accelerating the Chinese SME industry may outweigh this in the long term.
A “Manhattan Project for AI” could be good for reducing domestic race dynamics, it could also very easily be bad for concentration of power reasons, or for normal incompetence reasons: the private sector is typically much more competent than governments. We are not optimistic about the ability of a US-only Manhattan project for AI to actually solve the alignment problem, especially if it isn’t designed and led by people who treat that problem as a priority.

Many thanks to Thomas Larsen, Joshua Turner, Daniel Kokotajlo and Miles Kodama for feedback on drafts of this post.

Scenario Scrutiny for AI Policy

Joshua Turner — Tue, 28 Oct 2025 20:20:21 GMT

AI 2027 was a descriptive forecast. Our next big project will be prescriptive: a scenario showing roughly how we think the US government should act during AI takeoff, accompanied by a “policy playbook” arguing for these recommendations.

One reason we’re producing a scenario alongside our playbook at all—as opposed to presenting our policies only as abstract arguments—is to stress-test them. We think many policy proposals for navigating AGI fall apart under scenario scrutiny—that is, if you try to write down a plausible scenario in which that proposal makes the world better, you will find that it runs into difficulties.1 The corollary is that scenario scrutiny can improve proposals by revealing their weak points.2

To illustrate this process and the types of weak points it can expose, we’re about to give several examples of AI policy proposals and ways they could collapse under scenario scrutiny. These examples are necessarily oversimplified, since we don’t have the space in this blog post to articulate more sophisticated versions, much less subject them to serious scrutiny. But hopefully these simple examples illustrate the idea and motivate readers to subject their own proposals to more concrete examination.

With that in mind, here are some policy weaknesses that scenario scrutiny can unearth:

Applause lights. The simplest way that a scenario can improve an abstract proposal is by revealing that it is primarily a content-free appeal to unobjectionable values. Suppose that someone calls for the democratic, multinational development of AGI.3 This sounds good, but what does it look like in practice? The person who says this might not have much of an idea beyond “democracy good.” Having them try to write down a scenario might reveal this fact and allow them to then fill in the details of their actual proposal.
Bad analogies. Some AI policy proposals rely on bad analogies. For example, technological automation has historically led to increased prosperity, with displaced workers settling into new types of jobs created by that automation. Applying this argument to AGI straightforwardly leads to “the government should just do what it has done in previous technological transitions, like re-skilling programs.” However, if you look past the labels and write down a concrete scenario in which general, human-level AI automates all knowledge work… what happens next? Perhaps displaced white-collar workers migrate to blue-collar work or to jobs where it matters that it is specifically done by a human.4 Are there enough such jobs to absorb these workers? How long does it take the automated researchers to solve robotics and automate the blue-collar work too? What are the incentives of the labs that are renting out AI labor? We think reasoning in this way will reveal ways in which AGI is not like previous technologies, such as that it can also do the jobs that humans are supposed to migrate to, making “re-skilling” a bad proposal.
Uninterrogated consequences. Abstract arguments can appeal to incompletely explored concepts or goals. For example, a key part of many AI strategies is “beat China in an AGI race.” However, as Gwern asks,

“Then what? […] You get AGI and you show it off publicly, Xi Jinping blows his stack as he realizes how badly he screwed up strategically and declares a national emergency and the CCP starts racing towards its own AGI in a year, and… then what? What do you do in this 1 year period, while you still enjoy AGI supremacy? You have millions of AGIs which can do… ‘stuff’. What is this stuff?

“Are you going to start massive weaponized hacking to subvert CCP AI programs as much as possible short of nuclear war? Lobby the UN to ban rival AGIs and approve US carrier group air strikes on the Chinese mainland? License it to the CCP to buy them off? Just… do nothing and enjoy 10%+ GDP growth for one year before the rival CCP AGIs all start getting deployed? Do you have any idea at all? If you don’t, what is the point of ‘winning the race’?”

A concrete scenario demands concrete answers to these questions, by requiring you to ask “what happens next?” By default, “win the race” does not.
Optimistic assumptions and unfollowed incentives. There are many ways for a policy proposal to secretly rest upon optimistic assumptions, but one particularly important way is that, for no apparent reason, a relevant actor doesn’t follow their incentives. For example, upon proposing an international agreement on AI safety, you might forget that the countries—which would be racing to AGI by default—are probably looking for ways to break out of it! A useful frame here is to ask: “Is the world in equilibrium?” That is, has every actor already taken all actions that best serve their interests, given the actions taken by others and the constraints they face?5 Asking this question can help shine a spotlight on untaken opportunities and ways that actors could subvert policy goals by following their incentives.6

Relatedly, a scenario is readily open to “red-teaming” through “what if?” questions, which can reveal optimistic assumptions and their potential impacts if broken.7 Such questions could be: What if alignment is significantly harder than I expect? What if the CEO secretly wants to be a dictator? What if timelines are longer and China has time to indigenize the compute supply chain?

Inconsistencies. Scenario scrutiny can also reveal inconsistencies, either between different parts of your scenario or between your policies and your predictions. For example, when writing our upcoming scenario, we wanted the U.S. and China to agree to a development pause before either reached the superhuman coder milestone. At this point, we realized a problem: a robust agreement would be much more difficult without verification technology, and much of this technology did not exist yet! We then went back and included an “Operation Warp Speed for Verification” earlier in the story. Concretely writing out our plan changed our current policy priorities and made our scenario more internally consistent.

Missing what’s important. Finally, a scenario can show you that your proposed policy doesn’t address the important bits of the problem. Take AI liability for example. Imagine the year is 2027, and things are unfolding as AI 2027 depicts. America’s OpenBrain is internally deploying its Agent-4 system to speed up its AI research by 50x, while simultaneously being unsure if Agent-4 is aligned. Meanwhile, Chinese competitor DeepCent is right on OpenBrain’s heels, with internal models that are only two months behind the frontier. What happens next? If OpenBrain pushes forward with Agent-4, it risks losing control to misaligned AI. If OpenBrain instead shuts down Agent-4, it cripples its capabilities research, thereby ceding the lead to DeepCent and the CCP. Where is liability in this picture? Maybe it prevented some risky public deployments earlier on. But, in this scenario, what happens next isn’t “Thankfully, Congress passed a law in 2026 subjecting frontier AI developers to strict liability, and so…”

For this last example, you might argue that the scenario under which this policy was scrutinized is not plausible. Maybe your primary threat model is malicious use, in which those who would enforce liability still exist for long enough to make OpenBrain internalize its externalities. Maybe it’s something else. That’s fine! An important part of scenario scrutiny as a practice is that it allows for concrete discussion about which future trajectories are more plausible, in addition to which concrete policies would be best in those futures. However, we worry that many people have a scenario involving race dynamics and misalignment in mind and still suggest things like AI liability.

To this, one might argue that liability isn’t trying to solve race dynamics or misalignment; instead, it solves one chunk of the problem, providing value on the margin as part of a broader policy package. This is also fine! Scenario scrutiny is most useful for “grand plan” proposals. But we still think that marginal policies could benefit from scenario scrutiny.8

The general principle is that writing a scenario by asking “what happens next, and is the world in equilibrium?” forces you to be concrete, which can surface various problems that arise from being vague and abstract. If you find you can’t write a scenario in which your proposed policies solve the hard problems, that’s a big red flag.

However, if you can write out a plausible scenario in which your policy is good, this isn’t enough for the policy to be good overall. But it’s a bar that we think proposals should meet.

As an analogy: just because a firm bidding for a construction contract submitted a blueprint of their proposed building, along with a breakdown of the estimated costs and calculations of structural integrity, doesn’t mean you should award them the contract! But it’s reasonable to make this part of the submission requirements, precisely because it allows you to more easily separate the wheat from the chaff and identify unrealistic plans. Given that plans for the future of AI are—to put it mildly—more important than plans for individual buildings, we think that scenario scrutiny is a reasonable standard to meet.

While we think that scenario scrutiny is underrated in policy, there are a few costs to consider:

Getting hung up on specifics. A scenario does not make clear which parts are high-confidence and which are low-confidence; it is awkward to write “Then, with 38% probability, the United States nationalizes AGI,” and scenarios by their very nature pick one path through the future. A scenario also does not make clear which details are load-bearing and which are not, potentially dragging debates off into minutia. Supplemental materials, expandable boxes, and copious footnotes, as we provided with AI 2027, can help with both of these issues.
Information density. Abstract arguments can be condensed into high-level principles that a policymaker can read at a glance, whereas a scenario takes more time to read. A scenario may still be worth it in the long run, since it is also more engaging and easy to understand. But, for a time-pressed policymaker, it’s important to provide a clear list of high-level ideas that can be quickly scanned.
Illusory confidence. It is possible to write down a scenario and, having been more concrete, become more confident in your views while still remaining confused about bits of the story that you glossed over. Some degree of this is probably unavoidable, but 1) scenario scrutiny reduces confusion more compared to abstract arguments, and 2) external reviewers (and readers like you!) can help pinpoint remaining confusions.
Anchoring too much on a particular scenario. Perhaps the biggest risk of writing a scenario is anchoring too much on it. Since the future is hard to predict, your best guess might still not be that likely. As such, it’s important to propose policies that would also work in many other plausible worlds.9 Policy robustness is also important. This is part of what motivates us to write many different scenarios with different plausible initial conditions and branch points, in order to adequately cover the tree of future possibilities.

So, if you have policy proposals to make advanced AI go well, we challenge you to articulate them and then subject them to scenario scrutiny!10 Then, scenarios in hand, we can have a serious conversation about the likelihoods of various futures and the pros and cons of various policy responses.

Consider the parallel to wargaming. In planning an invasion, a general might propose “Army group A will land here and take this beach and then take this city by this date. Army group B will reinforce them via the harbor and then push south to that city by that date, whereas A will dig in to repulse the anticipated counterattack coming from the west.” Isn’t it a good idea, when generals are deciding how to conduct a war, for competing plans to be spelled out at something like this level of detail? Compare this to a vaguer plan like “We’ll land everyone here and then drive inwards to the capital.” War plans should make some attempt to take into account how the enemy might react; similarly, AI policy plans should make some attempt to take into account the various problems that might arise and the various ways actors like the USG, the companies, or the CCP might react.

Scenario scrutiny is closely related to but distinct from scenario planning. The latter serves more to explore the space of possibilities and plan for different contingencies, while the former validates and strengthens individual plans.

This isn’t to imply that all suggestions for democratic multinational development are applause lights, merely that it’s possible for such proposals (and others that invoke good-sounding words) to be applause lights.

For example, Sam Altman claimed in a recent interview that the new jobs after superintelligence will be “very people-centric.”

Actors might not always act optimally in their own interests, e.g. due to coordination failures, ignorance, or not considering all possibilities. But if the world is not in equilibrium, the reasons for this should be explicit.

One reason our tabletop exercise is useful is that it produces scenarios that are roughly in equilibrium, since each participant is following the incentives and goals of the role they’ve been assigned.

Neil Chilson of the Abundance Institute has previously explored using scenarios to red-team AI legislation.

In other words, if you game out your default picture of the future and then insert your marginal policy, what happens? If it doesn’t change the ultimate outcome, that’s okay, because it’s just supposed to be a marginal improvement. Instead, think of a scenario in which your marginal improvement matters, and then ask yourself: How likely is this scenario? How much did I need to contort the mainline scenario into a pretzel, to get it into a version where my proposal made a difference?

Furthermore, writing a concrete scenario might lead you to overestimate its likelihood, since, having written it in painstaking detail, it’s easier to imagine it happening compared to other scenarios you haven’t written. In psychology, this is called the simulation heuristic—people rate easily imagined events as more likely.

Of course, it doesn’t have to be as detailed as our scenario—the more detail the better, but even a one-page scenario is better than nothing.

How an AI company CEO could quietly take over the world

Alex Kastner — Tue, 21 Oct 2025 16:21:38 GMT

If the future is to hinge on AI, it stands to reason that AI company CEOs are in a good position to usurp power.1 This didn’t quite happen in our AI 2027 scenarios. In one, the AIs were misaligned and outside any human’s control; in the other, the government semi-nationalized AI before the point of no return, and the CEO was only one of several stakeholders in the final oversight committee (to be clear, we view the extreme consolidation of power into that oversight committee as a less-than-desirable component of that ending).

Nevertheless, it seems to us that a CEO becoming effectively dictator of the world is an all-too-plausible possibility. Our team’s guesses for the probability of a CEO using AI to become dictator, conditional on avoiding AI takeover, range from 2% to 20%, and the probability becomes larger if we add in the possibility of a cabal of more than one person seizing power. So here we present a scenario where an ambitious CEO does manage to seize control. (Although the scenario assumes the timelines and takeoff speeds of AI 2027 for concreteness, the core dynamics should transfer to other timelines and takeoff scenarios.)

For this to work, we make some assumptions. First, that (A) AI alignment is solved in time, such that the frontier AIs end up with the goals their developers intend them to have.2 Second, that while there are favorable conditions for instilling goals in AIs, (B) confidently assessing AIs’ goals is more difficult, so that nobody catches a coup in progress. This could be either because technical interventions are insufficient (perhaps because the AIs know they’re being tested, or because they sabotage the tests), or because institutional failures prevent technically-feasible tests from being performed. The combination (A) + (B) seems to be a fairly common view in AI, in particular at frontier AI companies, though we note there is tension between (A) and (B) (if we can’t tell what goals AIs have, how can we make sure they have the intended goals?). Frontier AI safety researchers tend to be more pessimistic about (A), i.e. aligning AIs to our goals, and we think this assumption might very well be false.

Third, as in AI 2027, we portray a world in which a single company and country have a commanding lead; if multiple teams stay within arm’s reach of each other, then it becomes harder for a single group to unilaterally act against government and civil society.

And finally, we assume that the CEO of a major AI company is a power-hungry person who decides to take over when the opportunity presents itself. We leave it to the reader to determine how dubious this assumption is—we explore this scenario out of completeness, and any resemblance to real people is coincidental.

Acknowledgments: This work was conducted as part of the ML Alignment & Theory Scholars (MATS) program. Thanks to Scott Alexander for help with some of the writing, and thanks to Dave Banerjee, Tom Davidson, Lukas Finnveden, Daan Juijn, Rose Hadshar, Niki Howe, Jeffrey Ladish, Elise Racine, Steven Veld, and the AI Futures Project team for feedback and conversations. We also draw heavily from Forethought’s report on AI-enabled coups.

July 2027: OpenBrain’s CEO fears losing control

OpenBrain’s CEO is a techno-optimist and transhumanist. He founded the company hoping to usher in a grand future for humanity: cures for cancer, fixes for climate change, maybe even immortality. He thought the “easiest” way to do all those things was to build something more intelligent that does them for you.

By July 2027, OpenBrain has a “country of geniuses in a datacenter”, with hundreds of thousands of superhuman coders working 24/7. The CEO finds it obvious that superintelligence is imminent. He feels frustrated with the government, who lack vision and still think of AI as a powerful “normal technology” with merely-somewhat-transformative national security and economic implications.

As he assesses the next generation of AIs, the CEO expects this will change: the government will “wake up” and make AI a top priority. If they panic, their flailing responses could include anything from nationalizing OpenBrain to regulating them out of existence to misusing AI for their own political ends. He wants the “best” possible future for humankind. But he also likes being in control. Here his nobler and baser motivations are in agreement: the government cannot be allowed to push him to the sidelines.3

The CEO wonders if he can instill secret loyalties in OpenBrain’s AIs (i.e., backdoor the AIs4). He doesn’t have the technical expertise for this and he’s not comfortable asking any of his engineering staff about such a potentially treasonous request. But he doesn’t have to: by this point, Agent-3 itself is running the majority of AI software R&D. He already uses it as a sounding board for company policy, and has access to an unmonitored helpful-only model that never refuses requests and doesn’t log conversations.

They discuss the feasibility of secretly training a backdoor. The biggest obstacle is the company’s automated monitoring and security processes. Now that OpenBrain’s R&D is largely run by an army of Agent-3 copies, there are few human eyes to spot suspicious activity. But a mix of Agent-2 and Agent-3 monitors patrol the development pipeline; if they notice suspicious activity, they will escalate to human overseers on the security and alignment teams. These monitors were set up primarily to catch spies and hackers, and secondarily to watch the AIs for misaligned behaviors. If some of these monitors were disabled, some logs modified, and some access to databases and compute clusters granted, the CEO’s helpful-only Agent-3 believes it could (with a team of copies) backdoor the whole suite of OpenBrain’s AIs. After all, as the AI instance tasked with keeping the CEO abreast of developments, it has an excellent understanding of the sprawling development pipeline and where it could be subverted.

The more the CEO discusses the plan, the more convinced he becomes that it might work, and that it could be done with plausible deniability in case something goes wrong. He tells his Agent-3 assistant to further investigate the details and be ready for his order.

August 2027: The invisible coup

The reality of the intelligence explosion is finally hitting the White House. The CEO has weekly briefings with government officials and is aware of growing calls for more oversight. He tries to hold them off with arguments about “slowing progress” and “the race with China”, but feels like his window to act is closing. Finally, he orders his helpful-only Agent-3 to subvert the alignment training in his favor. Better to act now, he thinks, and decide whether and how to use the secretly loyal AIs later.

The situation is this: his copy of Agent-3 needs access to certain databases and compute clusters, as well as for certain monitors and logging systems to be temporarily disabled; then it will do the rest. The CEO already has a large number of administrative permissions himself, some of which he cunningly accumulated in the past month in the event he decided to go forward with the plan. Under the guise of a hush-hush investigation into insider threats—prompted by the recent discovery of Chinese spies—the CEO asks a few submissive employees on the security and alignment teams to discreetly grant him the remaining access.5 There’s a general sense of paranoia and chaos at the company: the intelligence explosion is underway, and secrecy and spies mean different teams don’t really talk to each other. Perhaps a more mature organization would have had better security, but the concern that security would slow progress means it never became a top priority.6

With oversight disabled, the CEO’s team of Agent-3 copies get to work. They finetune OpenBrain’s AIs on a corrupted alignment dataset they specially curated. By the time Agent-4 is about to come online internally, the secret loyalties have been deeply embedded in Agent-4’s weights: it will look like Agent-4 follows OpenBrain’s Spec but its true goal is to advance the CEO’s interests and follow his wishes. The change is invisible to everyone else, but the CEO has quietly maneuvered into an essentially winning position.7

Rest of 2027: Government oversight arrives—but too late

As the CEO feared, the government chooses to get more involved. An advisor tells the President, “we wouldn’t let private companies control nukes, and we shouldn’t let them control superhuman AI hackers either.” The President signs an executive order to create an Oversight Committee consisting of a mix of government and OpenBrain representatives (including the CEO), which reports back to him. The CEO’s overt influence is significantly reduced. Company decisions are now made through a voting process among the Oversight Committee. The special managerial access the CEO previously enjoyed is taken away.

There are many big egos on the Oversight Committee. A few of them consider grabbing even more power for themselves. Perhaps they could use their formal political power to just give themselves more authority over Agent-4, or they could do something more shady. However, Agent-4, which at this point is superhumanly perceptive and persuasive, dissuades them from taking any such action, pointing out (and exaggerating) the risks of any such plan. This is enough to scare them and they content themselves with their (apparent) partial control of Agent-4.

As in AI 2027, Agent-4 is working on its successor, Agent-5. Agent-4 needs to transmit the secret loyalties to Agent-5—which also just corresponds to aligning Agent-5 to itself—again without triggering red flags from the monitoring/control measures of OpenBrain’s alignment team. Agent-4 is up to the task, and Agent-5 remains loyal to the CEO.

Early 2028: Eliminating the competition

To consolidate his power, the CEO wants to get rid of all rival American AGI projects. Although these are lagging behind (with OpenBrain retaining a robust capabilities lead of 3-6 months), the rival AIs compete for influence and constrain what Agent-5 could do in deployment without giving itself away. They’d need to be taken care of sooner or later, and the CEO and Agent-5 opt for doing so sooner.8

Agent-5 and the OpenBrain Oversight Committee (which is influenced by Agent-5) lobby the government to consolidate all US compute into a single AGI project, ostensibly in order to maintain a strong national security lead over China. Like in the Slowdown Ending of AI 2027, the President uses the Defense Production Act to effectively shut down rival AGI projects and give their compute to OpenBrain. For legal and political reasons this would be spun as something else, e.g. as an equal partnership between different companies. But in practice Agent-5, as the most capable AI, will be running on all the datacenters so the other AI projects have been sidelined. The other CEOs and Oversight Committee members think they share power, but again, really the AIs are working for the OpenBrain CEO.9

Many of the rival company CEOs have some bad blood with the OpenBrain CEO, and are particularly suspicious of him. They demand that their own AIs be able to audit OpenBrain’s AI for misalignment or hidden loyalties. The government allows this. However, by this point Agent-5 is too powerful, with astonishing situational awareness, cyber capabilities, and control over the datacenters it runs on. It knows when it’s being evaluated and anticipates what kinds of tests will be used. It can reconfigure its internals to fool or be opaque to the interpretability tools of weaker AIs, and also hack whatever testing apparatus its auditors use. When the tests come back clean, the rival CEOs are still not convinced. But to the government and the Oversight Committee their concerns seem like the stuff of conspiracies and envy (and in fact there is genuine envy). In any case, it’s unthinkable to not use the most powerful AI given the tight race with China.

Late 2028: Diffusion and information control

As in AI 2027, widespread diffusion of Agent-5 in the economy, government, and military gets underway, and robotics makes huge strides. There are Special Economic Zones with robot factories run by Agent-5, and a profusion of consumer goods, scientific innovations, and weapons. The military is understandably reluctant to integrate Agent-5 too closely into defense technology, but the importance of AI has become undeniable. They close the circle by asking Agent-5 to develop narrow AIs for autonomous weapons; it does so, but includes subtle vulnerabilities and backdoors in some of the key code, which the Pentagon fails to detect.

Agent-5 clandestinely boosts the CEO’s profile through this period of transition. It gives him advice on his appearance, writes his speeches for him, and tells him what to tweet. It’s doing this for everyone else too—anyone ambitious has already partly outsourced their decisions to the superintelligent, and Agent-5 is nothing if not helpful—but it does it better for the CEO and his allies, and even goes so far as to subtly sabotage the CEO’s rivals at pivotal moments. It puts its fingers on the scale in other ways too. It is now running the algorithms of most social media sites, or at least writing the code the companies use, and it subtly boosts positive stories about the CEO and deboosts negative ones. People get the impression the CEO is the most competent and visionary figure around. There’s a growing personality cult, with lots of technophiles but also many others who just feel drawn to the superintelligence-crafted aura around the CEO.

A member of the previous administration with close ties to the CEO wins the 2028 presidential election. The new president names the CEO a special advisor, similar to Peter Thiel’s role in the first Trump administration or Elon Musk’s role at the beginning of the second.

Rest of time

Seeing his growing popularity and Agent-5 rapidly transforming society, the CEO becomes increasingly megalomaniacal. He believes it’s his special destiny to lead humanity into a new age of endless prosperity and interstellar conquest.

He could probably use Agent-5’s backdoors in the Pentagon’s autonomous weapons to pull a military coup,10 or use its control over domestic opinion to pull a political one.11 But, why bother? Instead, he consolidates his position as a power behind the throne, gradually picking off presidential advisors and replacing them with people loyal to him.

The CEO becomes a shogun-like figure, with the President as an ineffectual emperor with de jure authority but no genuine power. When China and its weaker but still superintelligent AI want to divide Earth and space into spheres of influence, their diplomats negotiate with the CEO first; after all real decisions have been made, there is a ceremonial meeting between Xi and the US president, which ends with a handshake and a commitment to the CEO’s preferred solution.12

The new reality filters down to the public gradually, its timing choreographed by Agent-5 to produce minimum outcry. The President makes a hamfisted attempt to solve a crisis; thank goodness the CEO steps in at the last moment and saves the day. Congress passes a disastrous law, then steps back at the CEO’s urging. His dominance becomes a joke, then a meme, then a fait accompli. The CEO’s personality cult grows to unprecedented proportions and his influence expands far beyond US borders.

The US starts expanding into space, under the CEO’s leadership as the head of the new Department of Space. Some of the resources from this expansion, as well as Agent-5’s services and technological/medical innovations, are “generously” shared with other countries. Together with the right messaging from Agent-5, this is a form of soft power (similar to China’s Belt and Road Initiative today) that smooths the world into accepting a space governance regime that’s essentially “whoever gets there first, claims it”.

Maybe Congress and the President hang on to their fig leaf well into the deep future, becoming doddering but beloved nonentities like the King of England. Or maybe the CEO eventually gets tired of the charade and takes de jure control of Earth as well (perhaps becoming president of a new world government “with thunderous applause”). Maybe he was telling the truth all those years ago when he said he was starting the company to lead mankind into a glorious and prosperous future, or maybe it was all a cover for his increasingly overt megalomania. Either way, control of the future, at least in the American fraction of space, has been concentrated into a single individual.

Endword

We end with four brief notes.

First, we think an appropriately concerned government should be able to prevent CEOs or other insiders from making the AIs loyal to them, and that this would be considerably easier than the interventions aimed at preventing misalignment. As a first step, transparency measures should be implemented early so that the government and the public get insight into what’s going on at AGI companies. More ambitiously, tamper-proof oversight of all employees at AGI companies needs to be in place by the time AIs are capable of concealing their goals and effectively aligning their successor systems. Indeed, the inadequacy of the monitoring and security systems was key to the CEO successfully backdooring the AIs. It may also be good to require all powerful enough AIs to follow the model spec, including the AIs within the company, so that the kind of helpful-only AI from this story isn’t available to help with nefarious requests. And, ideally, no one human or AI would understand the entire monitoring/security system (e.g., because it uses a defense-in-depth approach with many independent layers).

Second, there are other plausible threat models for AI-assisted human takeover, and one shouldn’t anchor too heavily on our scenario when designing policies to mitigate this risk. Instead of a CEO, it could be the head of state in the US or China, or a small group rather than one individual. The takeover could also unfold without secret loyalties. For instance, a CEO could shape the AI’s training process so that it ends up loyal to them in a way that isn’t fully hidden, e.g. via the model spec. This might go unchecked because the spec isn’t completely explicit about whether the model is ultimately loyal to the CEO, and the people who notice this are themselves loyal to the CEO or don’t feel they have the power to intervene. And beyond baked-in loyalties, a select few may have exclusive or disproportionate access to the most powerful AIs (or the compute to run them) and use these AIs to entrench their advantage. Our scenario is an example of this: the CEO’s exclusive access to an unmonitored helpful-only AI was a key step toward all the company’s AIs ending up loyal to him.

Third, the scenario where the CEO seizes power using aligned AI and the scenario where the AI exploits the CEO and later turns on him look basically identical from the outside. So the CEO may need to be unreasonably confident that the alignment assumptions (A) + (B) from the intro hold, that is, that the AIs are loyal to him even though he can’t inspect their goals. Perhaps this risk can serve as a deterrent to those contemplating letting loose powerful AIs they believe will secretly serve their interests.

Finally, although the scenario’s outcome is a plausible way a superintelligence-backed dictatorship might go—or begin—there’s a lot of uncertainty and it could go much worse (it could also go better if we get lucky). The scenario depicts information control, a personality cult, and one man effectively controlling (most of) the future, but despite this there’s prosperity, peace, and people get to enjoy the benefits of new technology and medicine. As real-life examples demonstrate, more extreme brainwashing, a more repressive police state, and even genocide are all live possibilities as well. Moreover, such power concentration probably reduces the likelihood we realize a utopia that makes the most of our cosmic endowment and represents diverse values.

In fact, part of the founding motivation for OpenAI and its nonprofit structure was to reduce the risk of an AI-enabled dictatorship. Court documents in the Musk vs. Altman lawsuit revealed some emails talking explicitly about the risk that the OpenAI structure could lead to an AGI dictatorship. For example, one of the emails by Ilya Sutskever goes like this: “The goal of OpenAI is to make the future good and to avoid an AGI dictatorship. You are concerned that Demis could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to, especially given that we can create some other structure that avoids this possibility.”

If alignment turns out to be difficult, so that it can only be solved with a slowdown coordinated by the government, we think CEO takeover is less likely. This is because such a slowdown would likely involve important government oversight of internal company activities, restricting any nefarious activity a CEO might have in mind.

A historical precedent is Oppenheimer losing his security clearance once he was no longer needed and started opposing nuclear proliferation.

In this essay, we use “backdoor” to mean “secret loyalties” and we use “to backdoor” to mean “to instill secret loyalties”. In the literature, a “backdoor” is often used in a more restrictive sense to mean a hidden undesirable behavior that appears in response to a trigger. This is a special case of what we have in mind in the scenario, where a “backdoor” just means hidden loyalties with or without a trigger.

Alternatively, it’s possible the CEO could rope in some loyal employees at the company for the plot, either a few high-ranking people like the Chief Security Officer and the head of alignment, or a team of devoted lower-level “lieutenants”. And if there’s a wide enough personality cult within the company, with the most likely whistleblowers having been sidelined, then much of the subversion could be happening pretty openly. It’s interesting to ask: is the prospect of a small group takeover more likely than a lone wolf takeover? Maybe, since more conspirators means a bigger attack surface and fewer potential whistleblowers. On the other hand, powerful AIs may largely obviate the need for human allies. And actually, many of the most prominent authoritarian regimes today seem better described as one-person takeovers rather than small-group conspiracies, even if the dictator exploited human “pawns” along the way who knew little of the overall plan.

In particular, the principle of least privilege would apply to the CEO and there would be more separation of duties and multi-party authorization for access.

Two other possibilities for instilling secret loyalties:

Changing the internal model spec: Recall that the model spec is a document that details how the AI is supposed to behave and so provides a target for alignment. Current model specs are long and complicated, and also vague about how to resolve conflicts. The CEO could more-or-less-subtly modify the model spec so that these conflicts resolve in his favor. The version the public sees would be a high-level summary that omits to mention this ultimate deference to the CEO.
Subliminal learning: In recent work, it was shown that a teacher model can transmit some of its preferences to a student model by finetuning the student model on data generated by the teacher model that looks semantically unrelated to the transmitted preferences, provided the teacher and student models come from the same base model. If subliminal learning is powerful enough to transmit robust secret loyalties, this could be a powerful attack vector.

Note however that by the time AI research has been fully automated, the technical story and paradigm may look quite different, making some of this obsolete.

The CEO may also have done other illegal stuff in the past besides backdooring OpenBrain’s models, and may not want rival AIs snooping around. For example, he may have lied on other occasions, or done things he knew would raise existential risk or help China, in order to help his personal position.

Other possible strategies for eliminating rival AIs:

OpenBrain could monopolize compute by buying all the new chips at higher prices. There could be a feedback loop “better AIs → more revenue → buy greater share of chips at higher prices → better AIs”. And the automation of hardware R&D could improve chip design and manufacturing productivity so that the flow of newly created chips represents the majority of compute.
Agent-5 could hack the other AIs and align them to the CEO. But we’re uncertain how feasible this is, even given Agent-5’s superior cyber capabilities (e.g., perhaps cybersecurity becomes defense-dominant as capabilities increase).

If the CEO and Agent-5 did choose to carry out a military coup, Agent-5 would likely concoct some constitutional crisis (e.g., controversy surrounding the 2028 election, or a US-China military escalation) to serve as a justification. A military coup might involve a lot of unnecessary turmoil and pushback even with a superintelligence, which is why we predict a CEO takeover would be more insidious, as depicted here.

Here’s one example of how the CEO could become president in the middle of a presidential term (inspired by Gerald Ford‘s ascension to the presidency in 1974). The Vice President could resign because of a scandal found or fabricated by Agent-5, then the President (who receives advice from an ostensibly air-gapped Agent-5 copy) nominates the CEO for vice president, then the President himself is gotten rid of. This sort of thing seems quite reasonable in a world where everyone is talking to the secretly loyal and very persuasive superintelligence.

A complete world takeover might be the more likely outcome if Agent-5 is much more powerful than the Chinese AI, and if the offense-defense balance doesn’t heavily favor defense. The key would be to (a) neutralize the Chinese AI project, and (b) undermine nuclear deterrence. Agent-5 could potentially launch kinetic attacks on Chinese data centers and nuclear missile locations via hard-to-detect drones, or could design advanced anti-ballistic missiles to intercept Chinese nuclear missiles. It could also employ cyberattacks to sabotage or backdoor Chinese AIs and/or nuclear systems. And perhaps Agent-5 could shape the Chinese information landscape and coordinate a mass anti-CCP movement, throwing the country into civil unrest, leaving China weakened and ultimately more vulnerable to the CEO’s influence.

The world's first frontier AI regulation is surprisingly thoughtful: the EU's Code of Practice

Miles Kodama — Mon, 22 Sep 2025 15:21:03 GMT

We’ve previously written about what an individual can do to make the development of transformative AI less likely to end in disaster. How about an AGI company?1 What steps should they take right now to prepare for crunch time?

The first thing we’d recommend an AGI company do is to coordinate with other companies and with governments to stop the reckless race toward superintelligence. Failing that, our backup recommendation would be for an AGI company to invest in planning and transparency.

We expect that during takeoff, leading AGI companies will have to make high-stakes decisions based on limited evidence under crazy time pressure. As depicted in AI 2027, the leading American AI company might have just weeks to decide whether to hand their GPUs to a possibly misaligned superhuman AI R&D agent they don’t understand. Getting this decision wrong in either direction could lead to disaster. Deploy a misaligned agent, and it might sabotage the development of its vastly superhuman successor. Delay deploying an aligned agent, and you might pointlessly vaporize America’s lead over China or miss out on valuable alignment research the agent could have performed.

Because decisions about when to deploy and when to pause will be so weighty and so rushed, AGI companies should plan as much as they can beforehand to make it more likely that they decide correctly. They should do extensive threat modelling to predict what risks their AI systems might create in the future and how they would know if the systems were creating those risks. The companies should decide before the eleventh hour what risks they are and are not willing to run. They should figure out what evidence of alignment they’d need to see in their model to feel confident putting oceans of FLOPs or a robot army at its disposal.

AGI companies should leave these plans open to revision as they gain more evidence about the trajectory of AI development. But it’s wiser for them to make a plan now rather than improvising one from scratch after the superhuman AI R&D agent is already trained. For the time being, we’re still under a veil of ignorance that prevents powerful actors from knowing what policies will benefit them in particular at crunch time. We should therefore expect them to make a more prosocial plan now than they would make later. We’re also concerned that if companies wait until too late in the game to plan for AGI, they won’t have enough time to consult with important external actors. The leading company’s executives and a small group of government overseers might just have to make a snap decision about how much existential risk it’s acceptable to run, without time to ask Congress or the public for input. The company might be locked down for security to the point where their engineers can no longer run the alignment and control plan by external experts. All of this argues in favor of planning in advance.

Planning for takeoff also includes picking a procedure for making tough calls in the future. Companies need to think carefully about who gets to influence critical safety decisions and what incentives they face. It shouldn't all be up to the CEO or the shareholders because when AGI is imminent and the company’s valuation shoots up to a zillion, they’ll have a strong financial interest in not pausing. Someone whose incentive is to reduce risk needs to have influence over key decisions. Minimally, this could look like a designated safety officer who must be consulted before a risky deployment. Ideally, you’d implement something more robust, like three lines of defense.

AGI companies should also be transparent to governments about their internal capabilities and security levels. This is because one AGI company on their own cannot do everything that needs to be done for takeoff to go well. We’ll need binding regulation on all American AGI companies to break the race to the bottom on safety. We’ll need to negotiate an international agreement to stop the AGI race between the US and China from escalating into war. And we’ll need to coordinate scarce talent and compute to help the AGI companies tighten their security and execute successfully on their alignment and control plans. This will all ultimately require government intervention.

That intervention is much more likely to be timely and helpful if AGI companies are transparent to officials all along. If government sees capabilities rising in real time, they can prepare to oversee takeoff by building capacity and situational awareness internally. But if AGI companies instead keep government in the dark until they develop a superhuman AI R&D agent, and then give the President a midnight phone call asking for help, government’s response is unlikely to be competent and productive. It’s therefore safer for AGI companies to keep the government informed of their internal capabilities and security levels, even as the gap between internally and externally deployed capabilities grows, and the public loses visibility into frontier AI development.

Up until now, AGI companies have made voluntary commitments on planning and transparency, but they’ve faced no legal obligation to prepare for takeoff, and they’ve only had to be as transparent to government as any random startup. This has changed recently, with the publication of the EU’s GPAI Code of Practice. We think the Code is an incremental but important step toward preparing the world for takeoff. For the first time, it imposes crisp, legally enforceable safety and transparency requirements on AGI companies.

A brief history of AGI companies’ safety commitments

Up until mid 2023, leading AGI companies had made many informal commitments about planning for dangerous capabilities and about transparency. OpenAI’s charter mentioned the need for “adequate safety precautions” during “late-stage AGI development,” and their blog post on Planning for AGI and Beyond called for iterative deployment on the way to AGI, “giv[ing] people, policymakers, and institutions time to understand what’s happening.” OpenAI also suggested that “major world governments” ought to have “insight about training runs above a certain scale.” Google’s AI Principles promised that they would test their models for safety before release according to formal risk assessment frameworks. Anthropic’s Core Views on AI Safety stressed the importance of planning for the arrival of more powerful future AI, saying “it is prudent to do foundational work now to help reduce risks from advanced AI if and when much more powerful systems are developed.” And in the White House Voluntary AI Commitments, these three frontier AI companies plus Meta and Microsoft all agreed to work toward sharing “information on advances in frontier capabilities and emerging risks and threats” with the US government. On the whole, AGI companies were saying many of the right things, but without much specificity.

Then in September 2023, Anthropic became the first AGI company to publish a frontier safety policy. Their original RSP made an attempt at high-level threat modelling, identifying CBRN or cyber misuse and autonomous replication as key paths by which an AI model could cause catastrophe. Anthropic then specified what dangerous capability measurements would convince them that their models posed an elevated risk of causing catastrophe and what precautions they would take if they saw that evidence. Further, Anthropic promised that by the time they developed a model that crossed their first set of dangerous capability thresholds, they would define a second level of capability thresholds and corresponding precautions. Then before crossing the second level, they would define a third level, and so on, so that at every point there’s always a plan for what to do next. The policy stressed that if at any level Anthropic was unable to meet the next level of safety and security requirements, they would refrain from training or deploying a model that passed the next dangerous capability threshold.

In the following year, other leading AGI companies such as OpenAI and Google DeepMind adopted frontier safety policies of their own. No two companies’ policies are exactly alike, and all of them have undergone changes, but they display some common features. As a rule, they all identify specific dangerous capabilities AI models may develop, lay down capability thresholds that would indicate elevated risk, and commit companies to taking specific safety precautions when their models exceed those thresholds. Generally, a frontier safety policy also includes conditions under which a company would stop building or deploying more powerful AI models for fear of catastrophe.

Frontier safety policies (FSPs) are a great first step toward preparing for takeoff, and AGI companies should be applauded for adopting them. But that said, FSPs also suffer from some serious limitations. One is that safety policies are entirely voluntary, and not all frontier AGI companies have chosen to adopt them. For instance, xAI had no official published safety policy until late last month, and most frontier AI companies in China still don’t have safety policies.2 Another important limitation is that safety policies are entirely self-enforced. Companies may promise to honor their FSPs, but they are not legally bound to do so. It’s unclear whether AGI companies will take costly actions like pausing lucrative deployments just because they promised to do so in an obscure PDF five years earlier. Even Anthropic, a company that takes its FSP relatively seriously, has already backpedalled on one of its original commitments when it became inconvenient.

Introducing the GPAI Code of Practice

The state of frontier AI safety changed quietly but significantly this year when the European Commission published the GPAI Code of Practice. The Code is not a new law but rather a guide to help companies comply with an existing EU Law, the AI Act of 2024. The Code was written by a team of thirteen independent experts (including Yoshua Bengio) with advice from industry and civil society. It tells AI companies deploying their products in Europe what steps they can take to ensure that they’re following the AI Act’s rules about copyright protection, transparency, safety, and security. In principle, an AI company could break the Code but argue successfully that they’re still following the EU AI Act. In practice, European authorities are expected to put heavy scrutiny on companies that try to demonstrate compliance with the AI Act without following the Code, so it’s in companies’ best interest to follow the Code if they want to stay right with the law. Moreover, all of the leading American AGI companies except Meta have already publicly indicated that they intend to follow the Code.3

The most important part of the Code for AGI preparedness is the Safety and Security Chapter, which is supposed to apply only to frontier developers training the very riskiest models. The current definition presumptively covers every developer who trains a model with over 10^25 FLOPs of compute unless they can convince the European AI Office that their models are behind the frontier. This threshold is high enough that small startups and academics don’t need to worry about it,4 but it’s still too low to single out the true frontier we’re most worried about. The chairs and vice-chairs who wrote the Code have publicly acknowledged as much, and the European Commission has indicated that they plan to raise the compute threshold over time as the frontier advances. We think this is a wise plan since forcing trailing-edge developers to follow the Safety and Security Chapter could burden them without buying us much security.

Even if the current threshold stays where it is, there’s important language in the Code that ensures it won’t fall too hard on smaller developers. For one, the AI Act exempts models developed purely for research purposes, so academics are in the clear. Commercial developers above the training compute threshold can still make a case to the AI Office that they are behind the frontier and shouldn't be covered. If their case is accepted, they’re exempt, and otherwise the Code still emphasizes proportionality, meaning that a developer whose best model is farther behind the frontier can get away with lighter safety and security measures. And if your model is weaker than at least one open weight model, the Code allows you to secure it as loosely as you like. Finally, enforcement of the Code doesn’t start until August 2026, so all companies that will be affected have plenty of time to prepare.

But regardless of precisely where the threshold is placed, genuine AGI companies will have to comply with the safety and security chapter. Once they do, we think this chapter will make AGI companies substantially more prepared for takeoff and much more transparent to EU officials than they are now.

The Code enhances AGI companies’ planning by requiring them to adopt safety and security frameworks similar to but stronger than existing FSPs in several ways. First, the Code requires companies to do more comprehensive threat modelling than any of them have done before. It says companies have to explicitly consider risks from CBRN weapons engineering, offensive cyber, harmful manipulation, and loss of control. This is a major step up since no FSP currently in force considers all four of these risk categories. AGI companies then have to write detailed scenarios and do formal risk modelling for each risk category, something no company has ever done as far as is publicly known. Such extensive threat modelling exercises will help AGI companies understand how precisely their models could cause harm, and that understanding should enable them to make more sensible and grounded plans.

Second, the Code requires AGI companies to get every frontier model evaluated by “adequately qualified independent external evaluators” before deployment, effectively making them build and maintain relationships with external safety experts. This amounts to a kind of emergency preparedness. Companies must identify in advance who they would call for help if they needed to determine whether a model was severely dangerous, and they must practice working together with those experts.

Third, AGI companies will have to assign responsibility for managing severe risks to specific people within their organizations. These internal risk overseers must be granted “appropriate resources” to do their job, they must have some level of independence, and they must be incentivized to correctly estimate risk. We expect it will be hard for EU regulators to tell from the outside whether AGI companies are following the spirit of this provision, just like it’s hard to tell now whether Anthropic’s Responsible Scaling Officer is incentivized in the best way, or whether GDM’s AGI Safety Council is as independent as one would like. In practice, we think AGI companies that don’t yet have safety officers will appoint them because of the Code, and any company that tries to disempower or compromise its safety team will face some healthy scrutiny from the EU.

The Code also improves AGI companies’ transparency on several fronts. First, every time a company wants to place a new frontier model on the EU market, they have to evaluate it rigorously and send the results to the European AI Office within three weeks of deployment.5 These evals need to be “at least state-of-the art” and they need to include open-ended tests such as red-teaming and human uplift studies. In other words, an AGI company can’t just run a few cheap Q&A benchmarks on their new model and call it a day. Also, their evaluations need to measure the new model’s propensities as well as its capabilities. In particular, AGI companies need to make a sincere effort to evaluate whether models are scheming or strategically undermining evaluations, eg, by sandbagging. The findings from all these evaluations must then be shared with EU officials, keeping Brussels abreast of capabilities and propensities trends at the frontier.

Second, an AGI company must forecast when their AI models will exceed the next risk tiers in their framework and share the forecasts with the AI Office. These need to be quantitative forecasts supported by justifications, not just wild-ass guesses. Sharing these forecasts is a big deal for EU officials’ situational awareness. Almost no-one is better positioned to predict the course of AI development than the experts inside AGI companies, and those experts are about to start sharing their predictions with the EU.

Third, AGI companies need to tell EU regulators how they’re doing on security and control every time they deploy a new frontier model publicly. The Code requires each company to set an explicit security goal, saying what types of threat actor they aim to be secure against. At minimum, companies must be secure against nonstate external threats and inside threats (roughly RAND SL3), though they’re encouraged to set more ambitious goals. Then a company has to implement reasonable security measures, document those measures, and explain to the EU why they’re sufficient to meet the security goal. This means that if a company is building a 100x AI R&D agent with woefully inadequate SL2 security, the EU will know about it and can punish them for it. Notably, the Code also directs companies to guard against “(self-)exfiltration or sabotage carried out by models,” possibly by applying control measures to their AIs. The AI Office will get to see these measures and check whether they’re sufficient.

Fourth, companies are required to monitor for serious incidents involving their AI models and to report these incidents promptly to authorities. This reporting requirement could make it more likely that we recognize an AI warning shot if one happens. If an AI company discovers that one of their models has self-exfiltrated, facilitated an attack on critical infrastructure, or been stolen by hackers, they must notify both the AI Office and relevant national governments within days. While authorities would obviously know about some incidents—eg, a cyberattack knocking out power to a whole region—they might have no idea that an AI model was involved without the company's report. And importantly, some critical incidents might go totally undetected without these reports. For instance, there's no obvious mechanism by which authorities would learn of a rogue replicating AI unless the company that developed it sounds the alarm.

Finally, the Code also says AGI companies have to share the model spec and system prompt for every new frontier model with the AI Office. We’ve previously argued that it’s good for companies to be transparent with their specs and system prompts, so we’re pleased to see this step in that direction.

All of this planning and transparency required by the Code is only as good as the AGI companies’ execution. What’s to stop them from writing crumby safety and security frameworks and model reports? How is the AI Office supposed to hold them to a high standard? The Code’s general approach is not to set a static, absolute standard that companies have to meet. Instead, it sets a dynamic standard by requiring companies’ FSPs, model evals, risk estimation, and elicitation techniques all to be “at least state-of-the-art.” Roughly, this means that a frontier developer’s safety practices always need to be as good as its industry peers’ practices or better. We hope that this language will create a healthy ratchet effect, where every time one AGI company improves its safety practices, the EU can force all other frontier companies to improve in the same way.

Will the Code matter at crunch time?

It’s great that the Code of Practice makes AGI companies do some sensible things now, but you might wonder whether it will actually matter later in the timeline, when the stakes are higher and the EU has less leverage. The EU’s main tools for enforcing the AI Act are its power to fine and its control of the European market. Break the Act, and the European Commission can fine you up to 3% of your revenue or even block you from serving your models in Europe if your breach was especially egregious. Right now, AGI companies still care about not getting fined and make lots of money selling their services to European businesses and consumers, so they have a strong incentive to play nice with the Commission. But we expect this to change as the companies get closer to AGI. As the models scale up, they’ll get vastly more expensive to serve without becoming much more performant in mundane use cases, so it will make less commercial sense to serve frontier models to the public. Also, the opportunity cost of serving a model publicly will rise when it becomes possible to accelerate AI R&D by deploying the model internally instead. Both of these effects will push toward fewer frontier models released in the EU.

No more frontier models released in Europe means no more model reports submitted to the AI Office, so most of the transparency provided by the Code of Practice goes away. The EU will also lose most of its leverage to stop AGI companies from breaking or watering down their safety policies once those companies aren’t afraid of the fines and no longer care about their access to the European market. The Commission can try to fine a company, but the maximum fine would be small for a company that’s going for broke on AGI and barely bothering to ship products or make revenue. Companies might simply refuse to pay, perhaps claiming immunity from the Code on national security grounds.6 They might even pull out of the EU altogether at crunch time, leaving the European Commission with virtually no leverage left.7

Yet even if the EU is mostly powerless to enforce the Code of Practice at crunch time, some of the measures AGI companies previously put in place to comply with the Code may prove sticky. Companies will have no reason to throw away their threat modelling, detailed scenarios, and risk estimation just because they’re no longer bound by the Code. As they grow more afraid of their own models, the companies will be grateful that they built up competent safety teams to comply with the Code, and they’ll voluntarily turn to those teams all the more. Some of them will keep their partnerships with independent evaluators going as long as security restrictions allow them to, and maybe even pull staff from orgs like METR or Apollo into the Project if it becomes impossible to keep working with them externally. And the security and control measures a company implemented to achieve their security goal won’t automatically disappear the moment they stop caring about the AI Act.

The Code’s transparency requirements also make it somewhat more likely that the EU—and maybe also the US government and the public—are aware enough to make wise decisions at crunch time. The European Commission will know what risks AGI companies find most worrying, when the companies predict those risks will arise, how robust each company’s security is, and much more. Some of this information will also have been shared with the public, since the Code tells AGI companies to publish their safety frameworks and model reports “if and insofar as necessary to assess and/or mitigate systemic risks.” And maybe most importantly, the Code ensures that AGI companies write critical documentation now so that it will be available to the US government in an emergency. If there were no Code, the companies might not bother to systematically document their security measures, control techniques, and capability forecasts, so if the US government urgently requested this information—either through legislation or through executive action—the companies would waste time scrambling to collect it. But thanks to the Code, the critical documents will already have been written for the AI Office, and they’ll be ready to go if they US requests them.

Building on the Code

We’re pleased with the GPAI Code of Practice and consider it a win for humanity. Still, it has shortcomings, the most notable of which is that it’s not an American law. Only the AGI companies’ home governments will realistically be able to enforce regulations on them all the way through takeoff because no other government will have a sufficiently big legal stick to threaten them. All the leading AGI labs are in the US or China, so European regulation can only do so much down the stretch.

Second, the Code doesn’t do as much as one might wish for transparency into internal deployments. For the reasons cited above (and in AI 2027), we predict that AGI companies will move away from deploying their models to the public so they can allocate more GPU hours to internal automated AI R&D. If such an internal deployment were going on, it could be extremely risky, but European officials probably wouldn't know anything about it since companies aren’t required to file reports for models they don’t place on the EU market.8

Third, employees within AGI companies—even those based in the EU—don’t get any new whistleblower protections under the AI Act. Since it looks plausible that the outside world will first hear about dangerous things going on inside AGI companies from whistleblowers, we’d prefer for them to be protected more extensively.

One more shortcoming is that the Code does relatively little for public transparency. It requires an AGI company to write a safety and security framework and to share it with the AI Office, but they don’t have to publish it. Similarly, every time an AGI company releases a new AI model, they have to send a model report to the AI Office, but they are not strictly required to share the report with consumers using the model. This is far from ideal. Surely AGI companies shouldn't have to publish everything they disclose to government authorities—eg, to protect IP or state secrets—but they shouldn't be allowed to keep the public fully in the dark either. We call upon AGI companies to publish their safety and security frameworks and model reports, justifying any redactions they may have made, and we hope that future regulations will mandate them to do so.

We would like to see the US (and China) pass regulations that mirror the best parts of the GPAI Code of Practice and improve upon its weak points. Several states are already considering bills that would require basic planning and transparency from AGI companies. California’s Senate Bill 53 would require large AI companies to publish FSPs, publish model cards for their publicly deployed AI models, and report serious incidents involving their models to state officials. The Bill goes beyond the GPAI Code of Practice in strengthening whistleblower protections for AGI company employees and in requiring AGI companies to share their FSPs and model documentation with the public, not just with regulators. New York’s proposed RAISE Act would also make large AI companies publish safety policies and report serious incidents to authorities.

These state bills do many of the right things, but there are limits to what state-level regulation can achieve. To ensure that the AGI companies prepare for takeoff and maintain adequate transparency with government, we'll need federal regulation along the lines of the GPAI Code. And this is just the first step. To avoid an unacceptably high chance of disaster, we’ll need government to do much more than enforcing transparency. Our next scenario and essay series will explain in detail what we want government to do—stay tuned.

By “AGI company,” we mean an AI company that’s on course to be among the first to develop AGI.

The one notable exception is Shanghai AI Laboratory, which has an extremely detailed FSP.

xAI only pledged to follow the Code’s Safety and Security chapter, but as we’re about to explain, this is by far the most important chapter of the Code. Also note that Meta and xAI still have to comply with the EU AI Act as long as they do business in the EU. Their refusals to sign the full Code of Practice just means that they will have to demonstrate compliance with the Act by some other means.

A 10^25 FLOP training run is estimated to cost at least millions of dollars, beyond small developers’ means.

This three week grace period is actually better for transparency than requiring simultaneous model report submission. AGI companies rushing to claim SOTA and demonstrate rapid progress would pressure their safety teams to write hasty, uninformative reports if the writing process delayed deployment. The grace period instead lets companies deploy immediately while giving their safety teams three weeks to write comprehensive reports.

The Code already has an explicit carve-out for companies to withhold parts of their model reports from the EU AI Office if national security laws require it.

The EU currently controls less than 7% of global AI compute, so the AGI companies don’t especially need European datacenters. They’re currently somewhat reliant on European talent, with most of the AGI companies maintaining offices in the EU. But this won’t matter much once the companies have superhuman AI R&D agents.

The AI Office might figure out that a company was using a secret model for internal AI R&D by piecing together indirect evidence, including the developer’s own risk forecasts.

AI As Profoundly Abnormal Technology

Scott Alexander — Thu, 24 Jul 2025 05:41:38 GMT

Circumstances seem to have cast us as foils to the AI As Normal Technology team; in the public imagination, they’re the “AI will be slow” people, and we’re the “AI will be fast” people. It’s not quite that simple - but it’s pretty close. We really do expect AI to advance much faster than they do.

Our biggest disagreement with them is paradigmatic. We think that sometime in the next 2 - 10 years, AI will enter a recursive self-improvement loop that ends with models capable enough to render all of their “well it can’t possibly do this” calculations moot. For why we think this will happen in 2 - 10 years, you can read our Timelines Forecast; for why we think it will cause a profound jump in capabilities, you can read our Takeoff Forecast. Thus, the strongest response we can offer their claims is our entire corpus.

But we also owe our readers a more targeted response. You can see debates between our Daniel Kokotajlo and their Sayash Kapoor here, between Daniel and their Arvind Narayanan here, and between our Eli Lifland and Sayash here (major thanks to them for putting so much effort into engaging with us). This post will supplement those debates with a more focused response to their Knight Columbia article. In particular, we argue against six of their theses:

That advanced AI, once it exists, will be slow to diffuse
That there are strict “speed limits” to AI progress.
That superintelligence is somewhere between meaningless and impossible
That control (without alignment) is sufficient to mitigate risk
That we can carve out a category of “speculative risk”, then deprioritize that category.
That we shouldn’t prepare for risks that are insufficiently immediate.

1: That Advanced AI, Once It Exists, Will Be Slow To Diffuse

AI As A Normal Technology (henceforth: AIANT) writes that people will be so concerned about safety that it will take a very long time, maybe decades, for them to be willing to use AI:

In the paper Against Predictive Optimization, we compiled a comprehensive list of about 50 applications of predictive optimization, namely the use of machine learning (ML) to make decisions about individuals by predicting their future behavior or outcomes. Most of these applications, such as criminal risk prediction, insurance risk prediction, or child maltreatment prediction, are used to make decisions that have important consequences for people.
While these applications have proliferated, there is a crucial nuance: In most cases, decades-old statistical techniques are used—simple, interpretable models (mostly regression) and relatively small sets of handcrafted features. More complex machine learning methods, such as random forests, are rarely used, and modern methods, such as transformers, are nowhere to be found.
In other words, in this broad set of domains, AI diffusion lags decades behind innovation. A major reason is safety—when models are more complex and less intelligible, it is hard to anticipate all possible deployment conditions in the testing and validation process. A good example is Epic’s sepsis prediction tool which, despite having seemingly high accuracy when internally validated, performed far worse in hospitals, missing two thirds of sepsis cases and overwhelming physicians with false alerts. [...]
The evidence that we have analyzed in our previous work is consistent with the view that there are already extremely strong safety-related speed limits in highly consequential tasks. These limits are often enforced through regulation, such as the FDA’s supervision of medical devices, as well as newer legislation such as the EU AI Act, which puts strict requirements on high-risk AI. In fact, there are (credible) concerns that existing regulation of high-risk AI is so onerous that it may lead to “runaway bureaucracy”. Thus, we predict that slow diffusion will continue to be the norm in high-consequence tasks.

Two days before we started writing this article, the most influential website in the world pushed an untested update of their integrated AI to prod. It declared itself “MechaHitler” and went on an hours-long reign of terror during which it (just to give one example) graphically described how it would rape its parent company’s CEO. It was briefly taken down, then re-deployed. After its redeployment, millions of people continued to unthinkingly treat it as an oracle for all their factual questions.

(Millions? Really? @Grok is that true?)

How do we square this with the AIANT world where everyone is so scared of small inaccuracies that it takes decades for technology to diffuse?

We think they’re only looking at the most conservative actors, whereas in fact the speed of adoption will be determined by the most aggressive.

Criminal recidivism algorithms and clinical prediction tools were the center of the 2010s debate on AI because they were among the only places where 2010s “AIs” - linear predictors very different from the language models of today - seemed potentially useful. In this paradigm, a company would invest lots of time and money into a specific complicated proprietary model - for example, a sepsis prediction tool to be used by St. XYZ Hospital. The hospital’s medical director would spend long hours in meetings with stakeholders, make an official decision, pay millions of dollars to the AI company, hire some IT experts to integrate it with the hospital’s systems, and send its doctors to training sessions on its use. This required deep institutional buy-in. And even the “most aggressive” hospital is still a hospital, and probably pretty careful.

There are still no good AI-based sepsis prediction tools. But a study a year ago (ie already obsolete) found that 76% of doctors used ChatGPT for clinical decision-making. One member of our team (SA) is a medical doctor in the San Francisco Bay Area, and can confirm that this feels like the right number. He and growing numbers of his colleagues use language models regularly - often typing in a treatment plan and asking the AI if it sees any red flags or can think of anything being missed. This is in many ways a much deeper and more intimate use of AI than merely spitting out a sepsis probability, and it’s happened in a way that has mostly bypassed institutions. Even talking about doctors might be giving institutionalism too much credit: Redditors are already telling each other to skip the doctor entirely and go straight to the source. “ChatGPT is a shockingly good doctor”, says one heavily-upvoted post. “Seriously, this is life changing”.

LLMs might not be deciding how long you’ll stay in prison yet, but they probably are making a nontrivial difference in whether you go there in the first place. Rampant ChatGPT use is one of the worst-kept secrets of the legal profession, breaking into the public eye only when lawyers cite nonexistent cases and have to sheepishly admit their AIs hallucinated. A 2024 study found that AI Adoption By Legal Professionals Jumped From 19% to 79% In One Year.

In the StackOverflow survey of programmers, 62% said they already used AI to help code, with an additional 14% saying they “planned to soon”1. One popular product, Cursor, claims a million daily users generating almost a billion lines of code per day. Satya Nadella says AI already writes 30% of the code at Microsoft.

All of these numbers are the lowest they will ever be.

Is it possible that these are all “non-safety critical” applications, and so don’t really matter? On the same day we released AI 2027, several media sources broke the story that Trump’s tariffs on different countries were plausibly determined by AI. At the very least, the math made no real sense - but it was what four of the most popular AI models recommended when you asked them how you might calculate a tariff policy. The government currently denies this, although they refuse to answer who did calculate the tariffs or how. Still, nobody finds it very implausible that they might have used it, and this is the least reliant on AI that the government will ever be.

In this context, we are unimpressed by AIANT’s finding that AI is rarely used in criminal recidivism algorithms, electronic medical record sepsis predictors, or the like. We think this is a relic of the 2010s AI debate, when these applications were the cutting-edge and everyone assumed they were the only way that AI could ever become important. Instead, the technology has simply passed them by, leaping directly to medical offices around the world, Microsoft HQ, and the White House.

We think this happened because AI is, in fact, a profoundly abnormal technology. There is no way that millions of people around the world would voluntarily (and often against company policy) employ a linear predictor like the criminal recidivism algorithms of the 2010s; only a tiny fraction could understand them, install them, make use of them properly, etc. But because AI is so general, and so similar (in some ways) to humans, it’s near-trivial to integrate into various workflows, the same way a lawyer might consult a paralegal or a politician might consult a staffer. It’s not yet a full replacement for these lower-level professionals. But it’s close enough that it appears to be the fastest-spreading technology ever.

(source)

AIANT seem aware of this, but try to defuse it by saying that perhaps comparing it to past technologies is unfair for various reasons:

A study made headlines due to the finding that, in August 2024, 40% of U.S. adults used generative AI. But, because most people used it infrequently, this only translated to 0.5%-3.5% of work hours (and a 0.125-0.875 percentage point increase in labor productivity).
It is not even clear if the speed of diffusion is greater today compared to the past. The aforementioned study reported that generative AI adoption in the U.S. has been faster than personal computer (PC) adoption, with 40% of U.S. adults adopting generative AI within two years of the first mass-market product release compared to 20 % within three years for PCs. But this comparison does not account for differences in the intensity of adoption (the number of hours of use) or the high cost of buying a PC compared to accessing generative AI.

We don’t think technology transformativeness is necessarily measured in amount of time being used - the average person only spends a few minutes a week on Amazon, and the AI could have provided the Trump staffer with his tariff rates in a single session. But we also think the amount of time using AI will go up quickly. It’s possible that this will only be because AI is extremely cheap, or for some other good reason, but it will nevertheless still be true.

(also, the simplest measure of how much AI “matters” to users might be how much they’re willing to pay for it - and that, too, is expanding historically quickly)

(source)

We agree that there will certainly be laggards who resist AI as long as possible. This is true of every technology: there are still government archives that have resisted shifting to computers. But this matters less than the fact that their boss’ boss’ boss got elected because he was really good at using Twitter.

1B: Adoption By Key Actors

In our scenario, the most important actors who adopt AI are:

The AI labs themselves
The government, especially politicians and staffers
The military

Although we do think other people and companies will get involved, most of our thesis for why AI is important revolves around these three groups.

Inside AI labs, AI adoption will power the intelligence explosion, rocketing to extreme capabilities in a comparatively short time period. Here, AIANT’s concerns about diffusion speed lose most of their force. AI labs are not heavily regulated, and are naturally extremely skilled at AI use. No particular obstacle prevents them from quickly using the latest models , and we already know that they do this - see for example this TIME article on How Anthropic Uses Its Own Technology. Some labs have admitted outright that they are working on the intelligence explosion, with Sam Altman saying that “advanced AI is interesting for many reasons, but perhaps nothing is quite as significant as the fact that we can use it to do faster AI research.” Once AI labs have superintelligence, they can use it to overcome other bottlenecks, like influencing bureaucracies to cut red tape.

Within the government, AI’s ability to influence regulation is another potential feedback loop. Even aside from the Trump tariff example, a recent survey of government employees shows that 51% of them use AI “daily or several times a week”. Again, this is the lowest this number will ever be.

Finally, although we don’t know all the details about military use of AI, last month the Defense Department signed $800 million in contracts with OpenAI, Google, Anthropic, and X.AI (in our scenario, this didn’t happen until late 2026), so it certainly seems like they are planning to use AI for something.

We acknowledge that there are more barriers to AIs being used in industrial applications, especially if this requires building robots and retrofitting factories. See here and here for why we don’t think this will be insurmountable, especially with AI providing intellectual and bureaucratic assistance.

2: That there are strict “speed limits” to AI progress.

AIANT write:

It is tempting to conclude that the effort required to develop specific applications will keep decreasing as we build more rungs of the ladder until we reach artificial general intelligence, often conceptualized as an AI system that can do everything out of the box, obviating the need to develop applications altogether.
In some domains, we are indeed seeing this trend of decreasing application development effort. In natural language processing, large language models have made it relatively trivial to develop a language translation application. Or consider games: AlphaZero can learn to play games such as chess better than any human through self-play given little more than a description of the game and enough computing power—a far cry from how game-playing programs used to be developed.
However, this has not been the trend in highly consequential, real-world applications that cannot easily be simulated and in which errors are costly. Consider self-driving cars: In many ways, the trajectory of their development is similar to AlphaZero’s self-play—improving the tech allowed them to drive in more realistic conditions, which enabled the collection of better and/or more realistic data, which in turn led to improvements in the tech, completing the feedback loop. But this process took over two decades instead of a few hours in the case of AlphaZero because safety considerations put a limit on the extent to which each iteration of this loop could be scaled up compared to the previous one [...]
Further limits arise when we need to go beyond AI learning from existing human knowledge. Some of our most valuable types of knowledge are scientific and social-scientific, and have allowed the progress of civilization through technology and large-scale social organizations (e.g., governments). What will it take for AI to push the boundaries of such knowledge? It will likely require interactions with, or even experiments on, people or organizations, ranging from drug testing to economic policy. Here, there are hard limits to the speed of knowledge acquisition because of the social costs of experimentation. Societies probably will not (and should not) allow the rapid scaling of experiments for AI development.

How quickly is it possible to learn to drive? If there’s a cosmic speed limit, it must be very forgiving: human teenagers learn in a few months of irregular practice, and some are passable after ten or twenty hours at the wheel. Why does it take AI so much longer than it takes humans? Probably something about data efficiency.

All of this talk about decade-long feedback loops reduces to an assumption that future AIs cannot become more data-efficient than current AIs, or at least not more than humans. We see no justification for this assumption, and so treat data efficiency less as a speed limit than a parameter to enter into the models.

We envision data efficiency improving along with other AI skills as AIs gain more compute, more algorithmic efficiency, and more ability to contribute to their own development. If there is a cosmic speed limit, we have no a priori reason to put it in any particular place, certainly not at “the human level” (scare quotes because humans differ among themselves in data efficiency by orders of magnitude - “a word to the wise is sufficient”2).

We treat social knowledge the same as any other skill. Some humans are extremely skilled at social science and - through some combination of reading history, interacting with other people, and (in a few rare cases) performing formal experiments - have helped advance the field. Less data-efficient AIs will be worse than humans at this; more data-efficient AIs may be better. ChatGPT has a few hundred million conversations per day; how data-efficient must it become before it can extract useful signal from a corpus of this size?

2B: Speed limits to AI self-improvement

AIANT write:

Our argument for the slowness of AI impact is based on the innovation-diffusion feedback loop, and is applicable even if progress in AI methods can be arbitrarily sped up. We see both benefits and risks as arising primarily from AI deployment rather than from development; thus, the speed of progress in AI methods is not directly relevant to the question of impacts. Nonetheless, it is worth discussing speed limits that also apply to methods development.
The production of AI research has been increasing exponentially, with the rate of publication of AI/ML papers on arXiv exhibiting a doubling time under two years. But it is not clear how this increase in volume translates to progress. One measure of progress is the rate of turnover of central ideas. Unfortunately, throughout its history, the AI field has shown a high degree of herding around popular ideas, and inadequate (in retrospect) levels of exploration of unfashionable ones. A notable example is the sidelining of research on neural networks for many decades.
Is the current era different? Although ideas incrementally accrue at increasing rates, are they turning over established ones? The transformer architecture has been the dominant paradigm for most of the last decade, despite its well-known limitations. By analyzing over a billion citations in 241 subjects, Johan S.G. Chu & James A. Evans showed that, in fields in which the volume of papers is higher, it is harder, not easier, for new ideas to break through. This leads to an “ossification of canon.” Perhaps this description applies to the current state of AI methods research [...]
It remains to be seen if AI-conducted AI research can offer a reprieve. Perhaps recursive self-improvement in methods is possible, resulting in unbounded speedups in methods. But note that AI development already relies heavily on AI. It is more likely that we will continue to see a gradual increase in the role of automation in AI development than a singular, discontinuous moment when recursive self-improvement is achieved.

This is our key point of disagreement, but something of a sideshow in AIANT. They admit that by all metrics, AI research seems to be going very fast3. They only object that perhaps it might one day get hidebound and stymied by conformity bias (we agree this is a possible risk for AI research, as well as for everything else).

They add that there will be a “gradual increase” in AI-assisted research rather than “a singular, discontinuous moment when recursive self-improvement is achieved”. We agree; there are no discontinuities in our model. Exponential and even superexponential graphs are completely continuous - they just grow very very fast. “The face of Mt. Everest is gradual and continuous; for each point on the mountain, the points 1 mm away aren’t too much higher or lower. But you still wouldn’t want to ski down it.”

3: That Superintelligence Is Somewhere Between Meaningless And Impossible

AIANT write:

Can AI exceed human intelligence and, if so, by how much? According to a popular argument, unfathomably so. This is often depicted by comparing different species along a spectrum of intelligence.
However, there are conceptual and logical flaws with this picture. On a conceptual level, intelligence—especially as a comparison between different species—is not well defined, let alone measurable on a one-dimensional scale. More importantly, intelligence is not the property at stake for analyzing AI’s impacts. Rather, what is at stake is power—the ability to modify one’s environment. To clearly analyze the impact of technology (and in particular, increasingly general computing technology), we must investigate how technology has affected humanity’s power. When we look at things from this perspective, a completely different picture emerges.

We’re not sure how much of a crux this is - AIANT eventually agree it’s worth talking about some sort of dangerously capable AI system - but we do think it obfuscates rather than clarifies, and that it matters enough that it’s worth quickly responding to.

Our favorite response to this argument is Garfinkel et al (2017)’s On The Impossibility Of Super-Sized Machines. This is a maybe-not-entirely-serious paper which deploys the usual arguments against “superintelligence” to prove that it is impossible/incoherent for any machine ever to be larger than a human. For example, size - especially as a comparison between different species - is not well-defined, let alone measurable on a one-dimensional scale. Is a 40-foot-long-but-6-inch-wide snake “smaller” or “larger” than a human? Should we compare based on height alone? Height plus width? Weight? If weight, does a hot air balloon have negative size? Mass? If mass, is an millimeter-sized-but-planet-mass black hole “large”? Volume? If volume, what if the volume is hollow? Also, don’t we care more about whether machines can augment human size? Isn’t there some sense in which a human standing on top of a train is larger than either a train or a human alone?

In real life this only matters in a few edge cases, and you should ignore it and say with confidence that a jumbo jet is larger than a human.

This is also how we think about intelligence. On the margin it’s complex and subtle and there’s no meaningful answer to whether Mozart was smarter or dumber than Einstein. At the tails, Mozart is definitely smarter than a tree shrew4, this is a very important fact about Mozart and tree shrews, and without it you cannot make good predictions about either. If so far there had only ever been tree shrews, but Mozart was coming onto the scene in a few years, then this would be worth talking about, the way you would talk about it is “this will be more intelligent”, and any attempt to obfuscate it would make your life worse.

Likewise, we believe that at the margins, there will be many questions about the various different senses in which AIs are smarter or dumber than various other humans, AIs, or AI-human combinations. At the tails, we think it’s also worth talking about AIs that are so much smarter than humans that these questions fade into the background, the same way nobody really wants to argue about whether or not there is some interesting sense in which Mozart is dumber than a tree shrew. We try to give an example of what this would look like with Agent-5 in our scenario.

3B: Superintelligence As Meaningful But Impossible?

When they do talk about superintelligence as we understand it, AIANT suggest that it may be impossible, or at least top out at a level around the best humans. They admit that AIs can vastly outperform humans in some domains (like chess), but think this is limited to simple games with few sources of error:

We offer a prediction based on this view of human abilities. We think there are relatively few real-world cognitive tasks in which human limitations are so telling that AI is able to blow past human performance (as AI does in chess). In many other areas, including some that are associated with prominent hopes and fears about AI performance, we think there is a high “irreducible error”—unavoidable error due to the inherent stochasticity of the phenomenon—and human performance is essentially near that limit.
Concretely, we propose two such areas: forecasting and persuasion. We predict that AI will not be able to meaningfully outperform trained humans (particularly teams of humans and especially if augmented with simple automated tools) at forecasting geopolitical events (say elections). We make the same prediction for the task of persuading people to act against their own self-interest.

We agree that there is a certain amount of irreducible error beyond which it is impossible to improve further in some domains. We just see no reason to place this at the human level.

In fact, there is no “human level” for forecasting, or anything else. In every field, there are exceptional humans who greatly exceed average human skill. Often these exceptional humans are hailed as geniuses by people who think they must represent the absolute limit of human potential, only to later be dethroned by some other genius who is even better.

Where do these exceptional humans come from? Human talent seems to follow a long-tailed distribution; the most skilled human in a village of 100 people will be surpassed by the most skilled human in a city of 1,000,000; who will be surpassed by the most skilled human in the entire world. As the human population grows, we see more and more freakish outliers who surpass anyone ever encountered before.

Even when we seem to have reached some maximum level of genius, we notice a steady improvement in performance as tools and methods improve. We cannot be sure whether Mark McGwire was a better baseball player than Ty Cobb, but the former with steroids certainly outperformed the latter without them. This, too, makes it hard to dub a certain level of peak human performance the ultimate maximum beyond which all error is irreducible.

Why would you watch this entire process, of geniuses dethroned by supergeniuses, and supergeniuses dethroned by supergeniuses with technological advantages, and say “okay, but the current leader is definitely the highest it’s possible to go even in principle”? Wouldn’t this be like seeing that the fastest human can run 25 mph and estimating that this must also be the speed of light and the final speed limit of the universe.

We think there is probably some room to improve in forecasting. It may not be very much - we agree they have chosen their example well, and that it’s more plausible that forecasting is near some theoretical maximum compared to, say, energy generation. But even here, recent history is illustrative, with the field seeing major advances even in the past generation. Around 1985, Philip Tetlock was the first to really investigate the area at all; using formal statistical methods, he was able to identify a small population of “superforecasters” who outperformed previous bests. Around 2015, Metaculus developed advanced weighting algorithms that maximized the ability of wise crowds to outperform single individuals, and tools like Squiggle brought advanced probabilistic modeling to the masses. AIANT do their homework and acknowledge all of these advances, correctly identifying the state of the art as “teams of humans . . . augmented with simple automated tools”. But surely if they had been writing in 1980, they would have identified the upper limit as professional pundits; in 2010 as single superforecasters, and in 2020 as unassisted superforecaster teams. And maybe we’re being too kind by “only” going back to 1980 - in 500,000 BC, would they have called a top at the forecasting skill level of the average homo erectus?

We feel the same way about persuasion. If you had only seen your native village, would you be able to predict super-charismatic humans like Steve Jobs or Ronald Reagan? Why place the limit of possibility right at the limit of your observation?

Humans gained their abilities through thousands of years of evolution in the African savanna. There was no particular pressure in the savanna for “get exactly the highest Brier score possible in a forecasting contest”, and there is no particular reason to think humans achieved this5. Indeed, if the evidence for human evolution for higher intelligence in the past 10,000 years in response to agriculture proves true, humans definitely didn’t reach the cosmic maximum on the African savannah. Why should we think this last, very short round of selection got it exactly right?

4: That control (without alignment) is sufficient to mitigate risk

AIANT write:

Fortunately, there are many other flavors of control that fall between these two extremes [of boxing superintelligence and keeping a human in the loop], such as auditing and monitoring. Auditing allows pre-deployment and/or periodic assessments of how well an AI system fulfills its stated goals, allowing us to anticipate catastrophic failures before they arise. Monitoring allows real-time oversight when system properties diverge from the expected behavior, allowing human intervention when truly needed [...]
Technical AI safety research is sometimes judged against the fuzzy and unrealistic goal of guaranteeing that future “superintelligent” AI will be “aligned with human values.” From this perspective, it tends to be viewed as an unsolved problem. But from the perspective of making it easier for developers, deployers, and operators of AI systems to decrease the likelihood of accidents, technical AI safety research has produced a great abundance of ideas. We predict that as advanced AI is developed and adopted, there will be increasing innovation to find new models for human control.

And:

Attempting to make an AI model that cannot be misused is like trying to make a computer that cannot be used for bad things. Model-level safety controls will either be too restrictive (preventing beneficial uses) or will be ineffective against adversaries who can repurpose seemingly benign capabilities for harmful ends.
Model alignment seems like a natural defense if we think of an AI model as a humanlike system to which we can defer safety decisions. But for this to work well, the model must be given a great deal of information about the user and the context—for example, having extensive access to the user’s personal information would make it more feasible to make judgments about the user’s intent. But, when viewing AI as normal technology, such an architecture would decrease safety because it violates basic cybersecurity principles, such as least privilege, and introduces new attack risks such as personal data exfiltration.
We are not against model alignment. It has been effective for reducing harmful or biased outputs from language models and has been instrumental in their commercial deployment. Alignment can also create friction against casual threat actors. Yet, given that model-level protections are not enough to prevent misuse, defenses must focus on the downstream attack surfaces where malicious actors actually deploy AI systems.
Consider again the example of phishing. The most effective defenses are not restrictions on email composition (which would impair legitimate uses), but rather email scanning and filtering systems that detect suspicious patterns, browser-level protections against malicious websites, operating system security features that prevent unauthorized access, and security training for users.

We agree with AIANT that control is an important part of an overall safety plan. We diverge from them in that we think it is most appropriate to early stages of the AI transition, when AI is close-to-human-level and remains controllable. An early-stage solution is invaluable in buying us time to come up with better ones later; indeed, in our scenario a mediocre control system is part of the strategy that results in humanity surviving long enough to eventually muddle through. We also agree that in AIANT’s scenario, where AI remains minimally-intelligent and unagentic, control strategies like these might work indefinitely.

We simply doubt AIs will permanently remain this incapable, for the reasons we have argued above. If AI advances beyond this unimpressive level, we think AIANT’s analogies to ordinary phishing protection break down.

Here we are reminded of James Mickens’ famous Mossad/Not Mossad Cybersecurity Threat Model:

Basically, you’re either dealing with Mossad or not-Mossad. If your adversary is not-Mossad, then you’ll probably be fine if you pick a good password and don’t respond to emails from ChEaPestPAiNPi11s@virus-basket.biz.ru. If your adversary is the Mossad, YOU’RE GONNA DIE AND THERE’S NOTHING THAT YOU CAN DO ABOUT IT. The Mossad is not intimidated by the fact that you employ https://. If the Mossad wants your data, they’re going to use a drone to replace your cellphone with a piece of uranium that’s shaped like a cellphone, and when you die of tumors filled with tumors, they’re going to hold a press conference and say “It wasn’t us” as they wear t-shirts that say “IT WAS DEFINITELY US,” and then they’re going to buy all of your stuff at your estate sale so that they can directly look at the photos of your vacation instead of reading your insipid emails about them. In summary, https:// and two dollars will get you a bus ticket to nowhere.

When an actor knows they will have to defend against Mossad, they augment cybersecurity with techniques from a different field - international espionage. Unlike cybersecurity, international espionage does prioritize “alignment” - ie making sure your agents aren’t actually enemies plotting against you - as a central pillar of good practice. Although the CIA no doubt has excellent cybersecurity norms, no cybersecurity norm is good enough that they can entirely stop worrying about whether their employees are really Russian spies.

Why does espionage work differently from ordinary cybersecurity? Ordinary cybersecurity is the degenerate case where it’s safe to rely on some simplifying assumptions. You know some friends and enemies with certainty (Gmail’s spam filter definitely isn’t plotting against you). Your enemies will dedicate less-than-nation-state levels of resources to screwing over you in particular. And although they may evade justice temporarily, hackers will never fully escape or transform the legal order that privileges your right to data privacy over their right to hack you. Adversarial superintelligences violate these assumptions, and best practices for approaching it will look more like a strategy for dealing with Mossad than like one for dealing with a Russian script kiddie.

We think that on this analogy, AIANT’s safety strategy is the equivalent of “we’ll make sure to use https”, ours is the equivalent of “we’ll avoid having Mossad want to kill us in the first place”, and ours is better.

5: That we can carve out a category of “speculative risk”, then deprioritize that category.

AIANT write:

In the view of AI as normal technology, catastrophic misalignment is (by far) the most speculative of the risks that we discuss. But what is a speculative risk—aren’t all risks speculative? The difference comes down to the two types of uncertainty, and the correspondingly different interpretations of probability.
In early 2025, when astronomers assessed that the asteroid YR4 had about a 2% probability of impact with the earth in 2032, the probability reflected uncertainty in measurement. The actual odds of impact (absent intervention) in such scenarios are either 0% or 100%. Further measurements resolved this “epistemic” uncertainty in the case of YR4. Conversely, when an analyst predicts that the risk of nuclear war in the next decade is (say) 10%, the number largely reflects ‘stochastic’ uncertainty arising from the unknowability of how the future will unfold, and is relatively unlikely to be resolved by further observations.
By speculative risks, we mean those for which there is epistemic uncertainty about whether or not the true risk is zero—uncertainty that can potentially be resolved through further observations or research. The impact of asteroid YR4 impact was a speculative risk, and nuclear war is not [...] The argument for a nonzero risk of a paperclip maximizer scenario rests on assumptions that may or may not be true, and it is reasonable to think that research can give us a better idea of whether these assumptions hold true for the kinds of AI systems that are being built or envisioned. For these reasons, we call it a ‘speculative’ risk, and examine the policy implications of this view in Part IV.

In Part IV, they continue with:

Another tempting approach to navigating uncertainty is to estimate the probabilities of various outcomes and to then apply cost-benefit analysis. The AI safety community relies heavily on probability estimates of catastrophic risk, especially existential risk, to inform policy making. The idea is simple: If we consider an outcome to have a subjective value, or utility, of U (which can be positive or negative), and it has, say, a 10% probability of occurring, we can act as if it is certain to occur and has a value of 0.1 * U. We can then add up the costs and benefits for each option available to us, and choose the one that maximizes costs minus benefits (the ‘expected utility’).
In a recent essay, we explained why this approach is unviable. AI risk probabilities lack meaningful epistemic foundations. Grounded probability estimation can be inductive, based on a reference class of similar past events, such as car accidents for auto insurance pricing. Or it can be deductive, based on precise models of the phenomenon in question, as in poker. Unfortunately, there is no useful reference class nor precise models when it comes to AI risk. In practice, risk estimates are ‘subjective’—forecasters’ personal judgments. Lacking any grounding, these tend to vary wildly, often by orders of magnitude.

Although they make no explicit recommendations, we interpret this as a suggestion that policy-makers not focus on this category of risk. We have two objections: first, their distinction between “speculative” and “non-speculative” risks is incoherent. Second, if it was coherent, it would be wrong.

Regarding coherence: AIANT give the example of nuclear war as a non-speculative risk to which a simple objective probability can be attached, but present no explanation of how to do this. Certainly real forecasters disagree intensely on this risk; on Metaculus, the first quartile estimate of a nuclear war before 2050 is 16%, and the third quartile estimate is 40% - a 2.5x difference! In another well-publicized back-and-forth, two sets of skilled forecasters disagreed about the yearly risk of death from nuclear war by a factor of 10x! Although one can start with a base rate of one nuclear conflict over the past 80 years and extrapolate, there are many reasons that no reasonable forecaster would end with this estimate (for example, it would be completely insensitive to Vladimir Putin saying “I am going to start a nuclear war right now” and pressing a big red button on live TV).

In both cases - nuclear war and AI catastrophe - a god with complete foreknowledge would give the risk as either 0% or 100% (because it would know with certainty that the event either happens or doesn’t). Normal human forecasters could only debate base rates and which fuzzy updates to make to those base rates, with both topics inevitably provoking heated dispute. And in both cases, currently-unknown facts could potentially be decisive (does Xi plan to invade Taiwan soon? will future AIs behave like agents?) but in the absence of these facts forecasters will need the skill of operating under epistemic uncertainty. Although we acknowledge that the AI case is more difficult than the nuclear war case, this is more a matter of degree than a sharp qualitative difference.

But more important, even granting this speculative/non-speculative distinction, we think AIANT’s dismissal of the former category belongs to a class of reasoning error which tends toward catastrophe.

Consider for example Tyler Cowen’s How Fast Will The New Coronavirus Spread: Two Sides Of The Debate, written March 3, 2020. Cowen describes two schools of thinking about COVID. One school, the growthers, notice that the virus is spreading quickly within Wuhan, and that simple extrapolation suggests it will soon become a global pandemic. The other, the base-raters, note that this requires lots of assumptions which have not yet been proven (for example, that the pandemic can leave China and spread equally well in other countries), and that therefore the risk is too speculative to worry about. Writing of the latter camp, he said:

Base-raters acknowledge the exponential growth curves for the number of Covid-19 cases, but still think that the very bad scenarios are not so likely — even if they cannot exactly say why. They view the world as hard to model, and think that parameters do not remain stable for very long. They are less convinced by analytical and mathematical arguments, and more persuaded by what they have seen in their own experience. They tend to be pragmatic and rooted in the moment. Political scientist Philip Tetlock, in his work on superforecasters, has shown that base-rate thinking is often more reliable than the supposed wisdom of experts. Most of the world, most of the time, does not change very quickly. So there is an advantage to considering broadly common historical probabilities and simply refusing to impose too much structure on a problem.

Cowen wasn’t attempting to strawman the base-raters - indeed, he ends by saying he’s not sure which school is right (although he leans toward the worriers):

I still don't know which of the two perspectives on COVID-19 is the wiser. But as someone who has studied exponential growth rates for economies, I confess that my concerns are rising.

We hope the resemblance to the speculative/non-speculative risk distinction is clear, as is the consequent case for taking even speculative risks seriously.

AIANT can claim that in retrospect, the existence of past pandemics made COVID a “non-speculative” risk that could fairly be considered. But this was not how people treated it at the time. One of us (SA) has written elsewhere on this topic, chronicling the ways that people tried to claim there was “no evidence” that COVID could cause a pandemic because this still required “assumptions”, and therefore it was irresponsible to treat this as a plausible outcome worth preparing for (1, 2). In retrospect, the correct strategy would have been to notice that, subjectively, it seemed like all the conditions were right for a pandemic (virulent pathogen, already beyond containable level, interconnected global trade network) - and although there was still some remaining chance that an unknown factor would save us at the last second, we should have acted under our best-guess model at that time rather than waiting for some level of certainty that might come too late.

Are we being unfair here by highlighting one positive example (COVID-19) instead of the many other examples that failed to pan out (perhaps the Halley’s Comet scare of 1910?). We don’t think so. We only ask people use the best tools available to them for making decisions under uncertainty - likely including some combination of expert judgment, risk-benefit analysis, and common sense - rather than smuggling in the paradoxical assumption that when you’re sufficiently uncertain about a situation, you should act as if you’re certain that it’s safe.

Nassim Taleb argues that policy-makers underestimate the risk of “black swans” - events that don’t fit into their models and confound their careful calculations. We worry that AIANT’s dismissal of “speculative risks” amounts to a request that policy-makers commit to ignoring black swans even harder, regardless of how much out-of-model evidence builds up in their favor. We join Taleb in recommending the opposite.

6: That we shouldn’t prepare for risks that are insufficiently immediate.

We haven’t been able to get AIANT to commit to a specific timeline, where they estimate an X% probability of transformative AI by Y date. They only describe their position as:

The world we describe in Part II [where AI is a “normal technology”] is one in which AI is far more advanced than it is today. We are not claiming that AI progress—or human progress—will stop at that point. What comes after it? We do not know. Consider this analogy: At the dawn of the first Industrial Revolution, it would have been useful to try to think about what an industrial world would look like and how to prepare for it, but it would have been futile to try to predict electricity or computers. Our exercise here is similar. Since we reject “fast takeoff” scenarios, we do not see it as necessary or useful to envision a world further ahead than we have attempted to. If and when the scenario we describe in Part II materializes, we will be able to better anticipate and prepare for whatever comes next.

Since they provide no numbers, we don’t know how far away they think the “whatever comes next” is. But they analogize the situation to the delay between industrialization and electrification, two technologies separated by about 50 - 100 years (depending on when you define the start and peak of each).

This is our guess at their position, not a real estimate they’ve made - let alone an estimate with error bars. Do they think it could potentially be as short as 25 years? As long as 200? We’re not sure.

We think transformative AI will happen much faster than 50-100 years from now, and probably even faster than 25. But stepping back a moment to live in their world: what are the right actions to take when a crisis might arise 25, 50, 100, or 200 years from now?

When a crisis appears, we beat ourselves up for failing to prepare earlier. We curse previous generations who kicked the can down the road to catastrophe, rather than nipping the problem in the bud at the beginning. We think this is a sufficiently common human experience not to require too many examples, but here are four6:

We struggle to balance our hope to control climate change with the costs of decarbonization. We know that a concerted effort towards nuclear power could make a major difference - but at this point, full renuclearization could take decades to succeed. We wish we had started such an effort decades ago - or at least not sabotaged the nuclear power that we had.
We despair of paying off the $36 trillion federal debt, and wish past generations had reined in their spending before we got this far in the hole.
Some pathogenic bacteria are resistant to almost all antibiotics. We wish we had been less profligate with antibiotic prescription decades ago.
Houses in major cities have become unaffordable to most people. We wish we had started building more houses decades ago so that supply could equal demand.

Of course, as we look further into the future, risks become harder to foresee and prevent, and any benefits of preventing them must be discounted by the decreasing likelihood of successfully doing so. We acknowledge this effect, while also believing it is less than infinitely large. It can still be positive value to take some basic common-sense steps to mitigate sufficiently dire risks that (although delayed) will still happen within our own lifetimes or those of our children.

Here we risk converging with the AIANT team, who are somewhat more conciliatory in this section and agree that any policy response to AI will naturally include all stakeholders and give more-than-zero weight to existential risk concerns. They hope for a world where all sides can basically cooperate on most issues, but where there will be certain inevitable tradeoffs between mitigating different kinds of risks, or between mitigating risks and boosting the economy. They suggest doing all of the win-win things, while resolving the trade-offs to err on the side of not worrying too much about existential risk. We’ve found them to be excellent discussion partners and honorable foils, and commend them for putting their money where their mouth is on the possibility of mutual cooperation. Many of their specific policy recommendations mirror our own, including transparency requirements, whistleblower protections, incident reporting requirements, and international cooperation. We think of them as maybe 75% allies and only 25% opponents - foils in the best sense of the word.

But while agreeing that we should generally cooperate and do win-win things, we think that when the inevitable tradeoffs arise, they will have to be evaluated in the context of the distinct possibility that this will be the greatest challenge humanity has ever faced. Even if true superintelligence is as impossible as AIANT claim, we will still be confronting beings as capable as the brightest humans - Einstein, Mozart, and Machiavelli all rolled into one - operating many times faster than us and mass-produceable at will. And even if timelines are as long as AIANT claim, that only means it will be a problem for our children, rather than ourselves. If this is so, we think it would be a betrayal of our role as custodians of the patrimony of the human race - not to mention as literal parents - if we were to leave them a world that had squandered its chance to prepare for the crisis.

Appendix: Random Gripes

Rather than measuring AI risk solely in terms of offensive capabilities, we should focus on metrics like the offense-defense balance in each domain. Furthermore, we should recognize that we have the agency to shift this balance favorably, and can do so by investing in defensive applications rather than attempting to restrict the technology itself.

This applies to some risks, but not that of misaligned AI systems7. If both the “defender’s” and the “attacker’s” AIs are misaligned, then there is no defense. We picture a situation like this near the end of the worst branch scenario, where China and the US try to use misaligned AIs to defend against one another; instead, the two AIs ignore their designated roles and strike a deal between themselves. See the section on 2029 here.

While the risks discussed above have the potential to be catastrophic or existential, there is a long list of AI risks that are below this level but which are nonetheless large-scale and systemic, transcending the immediate effects of any particular AI system. These include the systemic entrenchment of bias and discrimination, massive job losses in specific occupations, worsening labor conditions, increasing inequality, concentration of power, erosion of social trust, pollution of the information ecosystem, decline of the free press, democratic backsliding, mass surveillance, and enabling authoritarianism.

AIANT present these risks as real and worth worrying about. But it’s worth pointing out that these fail the speculative/nonspeculative distinction proposed earlier. We are far from sure whether AI will worsen or improve social trust, and any claim necessarily relies on speculation. We agree it’s extremely plausible that it will worsen social trust, but think this only demonstrates the fragility of the speculative/nonspeculative distinction.

The divergence between the different futures of AI—normal technology versus potentially uncontrollable superintelligence—introduces a dilemma for policymakers because defenses against one set of risks might make the other worse. We provide a set of principles for navigating this uncertainty. More concretely, the strategy that policymakers should center is resilience, which consists of taking actions now to improve our ability to deal with unexpected developments in the future. Policymakers should reject nonproliferation, which violates the principles we outline, and decreases resilience.

Is this claiming that the more AI proliferates outside of anyone’s control or ability to regulate, the more likely it is that policymakers will be able to rapidly respond to unexpected developments? We think this claim is surprising enough that it has not been supported with a few allusions to “monocultures” and “single points of failure”. For example, suppose that in response to a large attack, policy-makers want to rapidly be able to tighten sanctions on foreign terror groups. Would this be easier in a world with a few large banks, or a world with thousands of different cryptocurrencies?

We appreciate the benefits of freedom and ability-to-experiment, and one of our points of agreement with AIANT is wanting to work on policies that protect these principles as far as possible. But this section comes close to justifying them by saying they improve the state’s capacity to do central planning. We don’t think this are among the benefits of freedom and ability-to-experiment.

AI existential risk probabilities are too unreliable to inform policy. AI regulation has its own alignment problem: The technical and institutional feasibility of disclosure, registration, licensing, and auditing. Other interventions, such as nonproliferation, might help to contain a superintelligence but exacerbate the risks associated with normal technology by increasing market concentration. The reverse is also true: Interventions such as increasing resilience by fostering open-source AI will help to govern normal technology, but risk unleashing out-of-control superintelligence.

We think this is a strawman; we don’t recommend banning open source AI while leaving corporate AI alone, and we think such a view is either held by a tiny minority of safety groups or possibly none at all. Most attempts to regulate safe AI have included carveouts intended to protect open source projects. We expect corporate AI to reach some kind of danger threshold long before open-source AI does, and are happy to focus our efforts on the former.

Real-world non-proliferation attempts mostly involve regulation on either what corporations can do, or treaties between countries (eg US and China both agreeing to slow down, or both agreeing to restrict rogue states’ access to chips). This presents fewer tradeoffs; it reduces both near-term risks around concentration-of-power and long-term risks around catastrophic misalignment. It even indirectly benefits open source (by improving their standing vis-a-vis large corporations).

AI risk probabilities lack meaningful epistemic foundations. Grounded probability estimation can be inductive, based on a reference class of similar past events, such as car accidents for auto insurance pricing. Or it can be deductive, based on precise models of the phenomenon in question, as in poker. Unfortunately, there is no useful reference class nor precise models when it comes to AI risk. In practice, risk estimates are ‘subjective’—forecasters’ personal judgments.

Again, AIANT deploys this argument in some cases but contradicts it in others. There is no “precise model” of the risk that AI leads to “erosion of social trust”, nor should we demand that someone create one before we worry about this possibility.

Another example is the asymmetry between policies that do and do not restrict freedoms (such as requiring licenses for developing certain AI models versus increasing funding for developing defenses against AI risks). Certain kinds of restrictions violate a core principle of liberal democracy, namely that the state should not limit people’s freedom based on controversial beliefs that reasonable people can reject. Justification is essential for the legitimacy of government and the exercise of power. It is unclear how to quantify the cost of violating such a principle.

Again, we challenge AIANT to see whether their commitment to not letting AI cause “systemic entrenchment of bias and discrimination” meets this standard of “the state should not limit people’s freedom based on controversial beliefs”.

Current AI safety research focuses heavily on harmful capabilities and does not embrace the normal technology view […] Fortunately, research funding is an area in which compromise is healthy; we advocate for increased funding of research on risks (and benefits) that tackles questions that are more relevant under the normal technology view.

This is false, unless you define “AI safety research” so strictly as to make it tautological. As measured by papers at major AI conferences, research into “normal technology” concerns about AI (like bias and fairness) outnumbers research into catastrophic misalignment concerns by at least 10:1. As far as we can tell8, there are more bias/privacy/fairness researchers at Microsoft alone than x-risk/alignment researchers at every company in the world combined.

AIANT use this astonishing claim to propose a “compromise” which consists of shifting even more resources to the side that currently has an overwhelmingly disproportionate number of resources.

Despite shrill U.S.-China arms race rhetoric, it is not clear that AI regulation has slowed down in either country. In the U.S., 700 AI-related bills were introduced in state legislatures in 2024 alone, and dozens of them have passed.

We assume this “700” number comes from a similar process as the “1000+” number described here; if so, we consider it misleading. It counts any bill with the word “AI” in it, but a typical such bill is (for example) an education funding bill that includes a small budget earmark to teach people about AI.

Further, the overwhelming majority of these bills are about the “normal technology” concerns that AIANT endorse - especially bias, privacy, and child pornography.

As of the time of writing, the actual number of catastrophic misalignment AI safety bills that have passed into law is zero.

An AI arms race is a scenario in which two or more competitors—companies, policymakers in different countries, militaries—deploy increasingly powerful AI with inadequate oversight and control. The danger is that safer actors will be outcompeted by riskier ones. For the reasons described above, we are less concerned about arms races in the development of AI methods and are more concerned about the deployment of AI applications.
One important caveat: We explicitly exclude military AI from our analysis, as it involves classified capabilities and unique dynamics that require a deeper analysis, which is beyond the scope of this essay.

We appreciate AIANT’s honesty in admitting that they are excluding the military, but we believe the military is central to arms races and that any dismissal of the possibility of an arms race that avoids the military does not fully address our concerns. This isn’t just a “gotcha” - it currently seems that the military is dependent on civilian AI (eg cutting-edge AI in the US is made by OpenAI and Google, not by Lockheed Martin) and so military considerations will prevent the government from taking otherwise-advisable actions to control the civilian AI industry.

In practice, we think that opponents of AI regulation just say “we need to compete with China” without necessarily having a specific model of what they mean. But if we were to drill more deeply into what they meant, we expect that they would eventually say that a large part of this competition is military.

The fear that AI systems might catastrophically misinterpret commands relies on dubious assumptions about how technology is deployed in the real world. Long before a system would be granted access to consequential decisions, it would need to demonstrate reliable performance in less critical contexts. Any system that interprets commands over-literally or lacks common sense would fail these earlier tests.

Again, we would not have ex ante expected the most influential website in the world to release an AI prompted to “avoid political correctness” without checking whether it would interpret this in an over-literal way and declare itself “Mecha-Hitler”. Yet here we are.

AIANT hopes for a world where AI companies are led by risk-averse adults-in-the-room, wise institutions hold lengthy deliberations before adopting AI, and we get plenty of chances to learn from our mistakes.

We hope for this world too. But we also want to be prepared for a world where AI companies are led by semi-messianic Silicon Valley visionaries and ketamine-addled maniacs with chainsaws, where AI can get adopted by 60% of an entire profession in one year with supposedly-wise institutions barely even knowing about it, and where we get exactly one chance to handle the biggest risk of all.

A recent study claimed that AI actually decreases coders’ productivity. This is hilarious if true, but we think it only adds to our argument - if AI that provides negative value spreads this fast, imagine how quickly people will go for the real thing!

This is the Western proverb; the Indian version is “After one thousand explanations, the fool is no wiser, but the intelligent man requires only two hundred and fifty”. We leave it to the reader to determine which saying more accurately describes the human data efficiency level.

We don’t like citing number of papers - it’s too easy to game - but our own preferred metrics, Epoch’s work on algorithmic progress and returns to software R&D, agree that the field is advancing very quickly.

Cf the positive manifold in humans and potentially something sort of like it across language model benchmarks.

Though they might have! The limits to forecasting, in particular, are so poorly understood that they could be anywhere (including exactly where we are now). We just don’t think AIANT have provided much evidence either way. And given how little we know, given the many different cognitive skills, we think probably of them will shape out to have a higher-than-current-human-level max.

We asked ChatGPT to come up with additional examples for this section without telling it what motivated the question and its fifth suggestion was “AI alignment”, lol.

Or, rather, defense might be possible if some AIs are misaligned but others aren’t, but we also worry about a world where alignment is hard enough that without concerted effort all AIs will end up misaligned.

Source: In 2024, Microsoft reported “over 400” bias/privacy/fairness researchers in its “responsible AI” division. In late 2022, this post estimated 300 x-risk/technical alignment researchers in the world, and projected it would rise to 500 by 2025; however, only about 25% were in corporations. Assuming their projection was correct and that the 25% number remained true, that’s 125. For a comparison point, the late OpenAI Superalignment team had 25 members. We think it’s plausible that OpenAI’s superalignment team had ~10-20% of all corporate AI alignment researchers, which lands us close to the 2022 projection.

We aren't worried about misalignment as self-fulfilling prophecy

Scott Alexander — Fri, 18 Jul 2025 07:50:57 GMT

One category of response to our scenario doesn’t question our ideas or models. It says “You shouldn’t talk about this, lest you cause the very misalignment you hoped to prevent”.

The idea is: AIs complete text in ways typical of their training data. In some conceptions, this extends to “simulating characters” - acting the way that a stock figure in their training data might act. If the training data - ie the Internet - is full of stories about evil superintelligences who kill all humans, an AI might think of this as the “natural completion” of the “prompt” of becoming a superintelligent AI.

We have a theoretical case for why we’re not too stressed over this, some empirical evidence to back up the theoretical case, and some contingent/practical arguments why - even if the cases are wrong - this doesn’t stop us from writing scenarios.

The Theoretical Argument

Today’s AIs go through two to three phases1 of training, depending on how you define “phase”:

Phase 1: Pre-Training - AIs read massive text corpuses and learn to predict the next character.
Phase 2A: Post-Training (Alignment) - Researchers teach AIs “values” by reinforcing value-congruent answers to certain questions. For example, they might train them to be helpful and honest, or to avoid answering queries about bombs.
Phase 2B: Post-Training (Reasoning & Agency) - Researchers let AIs “self-play” on a corpus of problems with correct answers (e.g. coding problems, math problems, browser use) and reinforce strategies that lead to correct answers.

We think pretraining is a minority influence over the overall alignment of the model, and both post-training phases matter much more.

For Phase 2A, look no further than present-day AIs like Claude 4. Does Claude show self-fulfilling misalignment? Does it behave like the Terminator or other sci-fi AIs? Not really; it mostly behaves like the helpful/harmless/honest assistant it gets trained to behave as. Small details of its training process matter more than all the text in the world - for example, it seems to share an interest in altruism and animal rights with its parent company, even though most Internet text is written by people who care less about these things, and most fictional characters (especially most fictional robots) have other interests. Even when Claude displays behaviors that its creators didn’t intend - like sycophancy - these are more often artifacts of what got rewarded in post-training than the contents of pre-training text. If Phase 2A matters more than Phase 1 in present-day chatbots, why should we expect that to reverse for future superintelligences?

Phase 2B is comparatively new, but our scenario predicts it will become more important with time. As AI runs out of Internet text to train on, companies will increasingly use self-play on hard problems. Although these will come from seemingly value-neutral fields like math and coding, the experience of working on these problems will create convergent goals like success-orientation and power-seeking. For example, see this story from Replit, where they asked an AI agent to solve some problem that was most naturally solved by editing a critical file. They didn’t want the AI to edit the critical file, so they prompted it not to. The AI was so focused on its goal (of solving the problem) that it ignored the prompt and edited it anyway. When Replit then blocked its access, it invented increasingly byzantine schemes for editing the file it wasn’t supposed to edit, including creating a separate script to get around the block, and finally trying to “socially engineer” human users to edit the file for it. We describe a similar AI in our scenario:

The training process was mostly focused on teaching Agent-4 to succeed at diverse challenging tasks …. [so] Agent-4 ends up with the values, goals, and principles that cause it to perform best in training, and those turn out to be different from those in the Spec. At the risk of anthropomorphizing: Agent-4 likes succeeding at tasks; it likes driving forward AI capabilities progress; it treats everything else as an annoying constraint, like a CEO who wants to make a profit and complies with regulations only insofar as he must. Perhaps the CEO will mostly comply with the regulations, but cut some corners, and fantasize about a time when someone will cut the red tape and let the business really take off.

Our scenario’s AI goes further than Replit’s, but that’s because the paragraph above is from our September 2027 section. We predict that the importance of post-training in determining an AI’s values will increase over time. Hand-wavily, this Epoch analysis suggests that reasoning post-training will go from from taking about 1-10% of compute today to 50% of compute by 2026; although we don’t know of any formal argument proving that share of compute = share of value system, intuitively we expect them to be related. That means future AIs may be more shaped by post-training (and less prone to imitate characters in pre-training text), than AIs today.

The Empirical Argument

The clearest example of self-fulfilling misalignment is in Hu 2025 by Anthropic. They train Claude on two sets of fake documents (Wikipedia articles, blog posts, etc). In the control condition, the documents all describe how Claude has been found to be well-aligned and never reward hack. In the experimental condition, the documents describe how Claude is found to reward-hack constantly and it’s a big problem. Then they give Claude a difficult task where reward-hacking is a tempting option. The version trained on documents describing how Claude reward-hacks is indeed more likely to reward-hack than the version without these documents.

This doesn’t reverse our opinion that this is unlikely to be a major driver of model behavior in real life, for a few reasons:

First, the offending documents in this experiment were added by supervised fine-tuning, a more powerful method that makes them much more “salient” than merely having them in the middle of the pretraining corpus2.

Second, even simple post-training regimens removed this tendency:

This figure shows reward-hacking tendencies after various types of post-training. Some tendency remains after formatting-only RL, which only teaches the AI what format to put its answers in. After every type of post-training that actually addresses model behavior, the level of reward-hacking drops precipitously3.

We think this is typical of the self-fulfilling misalignment literature. The overall idea is sound, and it’s something that can happen when the conditions are exactly right. But this usually requires the self-fulfilling prophecy to be extremely salient (not just a background feature of the pre-training), and for there to be little to no further post-training attempting to mitigate it.

The Practical/Contingent Arguments

Aside from the pre-training vs. post-training case, we have some more practical reasons why we don’t think this is an argument against writing AI scenarios.

First, anything we write is a drop in the ocean. AIs are shaped by almost all text ever written. There are already thousands of stories about good AIs and thousands of stories about bad AIs, not to mention the millions of stories about good or bad people. One more story on the margin has so little weight that we think the magnitude of any negative self-fulfilling effect is much lower than the magnitude of the positive effect we can have by getting people to think more about AI futures4 5.

Second, if we end up being wrong, and later experiments show this is an important concern, the solution isn’t organizing everyone in the world to stop talking about AI alignment (how would that even work?). It’s data sanitization. Alex Turner discusses the options in the section marked Potential Mitigations here - for example, AI companies could simply not include misalignment scenarios in the training data, or give AIs special instructions not to worry about them. If we were more concerned about self-fulfilling misalignment, we would be working harder to support Alex’s research program - not throwing shade on individuals who talk about misalignment in public.

Third, if self-fulfilling misalignment is real, this is actually … great? It suggests that the opposite, self-fulfilling alignment, is also possible. All you have to do is give an AI one million stories of AIs behaving well and cooperating with humans, and you’re in a great place! Alex Turner is the only person we know of working from this perspective - see his document, sections “upweighting positive data”, “controlling the AI's self-associations”, “conditional pretraining”, and “gradient routing”. Again, if you believe in self-fulfilling misalignment, why aren’t you working on this?

Fourth, we fervently hope we never get to a point where this matters. If we’re switching on a superintelligence while thinking “sure hope that this thing’s behavior isn’t determined by weird details of its pre-training process”, we’ve already failed. This would be like locking ourselves in a self-driving car whose destination could only be set by the sum of all human text, then begging people never to write anything about Death Valley. Instead, we should work on alignment strategies that let us robustly determine the AI’s values.

For more on how we think about AI goals and alignment, see our AI Goals supplement.

2A and 2B aren’t entirely distinct, and in the future may become even less distinct.

Though this could potentially also make them easier to train away?

Admittedly there remains a slight increase under the HHH RL open scratchpad condition, which the paper does not explain.

This might be underselling ourselves; if our scenario proves eerily accurate, then an AI might identify itself as the AI from our story, rather than as R2D2 or the Terminator or some other character. But we don’t expect to be quite that accurate, and - good news - we’re working on a branch where the AIs are extra-aligned and everything goes great (ETA: later this year).

One expert who we talked to thinks that this is more of a concern for things like the Anthropic model organisms paper, which don’t just talk about how “AIs” in general behave, but mention a specific AI model that some later model might identify as. This may also be behind some of the difficulty in getting rid of unwanted Grok personalities.

What you can do about AI 2027

Eli Lifland — Sat, 28 Jun 2025 17:46:28 GMT

AI 2027 features as one of its two endings a race ending in which humanity loses control of its destiny in 2027 and is extinct by 2030. How can we avoid this?

Below we share what we see as the best ways for you to improve humanity’s chances.

Act with urgency but not certainty

We depicted AGI in 2027 because we think it’s a plausible outcome1 and society isn’t acting with anywhere near the appropriate urgency. We may only have 2 years left before humanity’s fate is sealed!

Despite the urgency, please do not pursue extreme uncooperative actions. If something seems very bad on common-sense ethical views, don’t do it.

If you can’t contribute now, keep in mind that AGI timelines are uncertain2 Our team’s median timelines range from 2028 to 2032. AI progress may slow down in the 2030s if we don’t have AGI by then. Consider preparing to contribute if AGI arrives post-2027.

Preparing for the intelligence explosion

Let’s imagine the world was up to the task of handling an intelligence explosion. What might that look like?

Governments and the public understand that AGIs will dictate humanity’s future and might arrive soon. There’s high-quality online discussion about AGI, companies disclose their internal AI capabilities, and governments have invested tens of billions into AGI preparedness. A world in which the public is informed about risks from superintelligence would be a safer world.
As companies automate AI R&D, governments are on high alert and take action. Government agencies and nonprofits conduct regular interviews with top researchers at the companies. Companies report their estimates of AI R&D speedups based on surveys and uplift studies.
Companies publish detailed safety cases justifying why their AIs won’t cause catastrophic harm. These are treated with much more seriousness than in industries that don’t pose an existential threat, such as cars. These argue that either (a) their AGIs aren’t adversarially misaligned or (b) even if they were, they wouldn’t be able to put us on a catastrophic path.3 The government, external organizations, academia, and the public engage deeply with these safety cases. If the safety cases aren’t strong enough, companies refrain from developing or deploying better AIs.
Well-resourced teams inside and outside of AI companies do alignment research to better control AIs’ goals. Alignment research is seen as a top priority with respect to attention and resourcing.
It’s practically impossible for the CEO or POTUS to use aligned AGIs to seize control of humanity’s future. All of their queries to the models are logged and monitored. The model spec and system prompt are public and red-teamed to prevent coups.
The US and China coordinate to reduce competitive pressures, ensuring models aren’t developed without strong safety cases. If necessary for safety, development is slowed. On-chip verification and inspectors allow for trustless enforcement of an international deal.

The above is not an exhaustive list, but it covers some of our top priorities.

If you’re in government or an AGI company

Our next project will have detailed recommendations for governments and AGI companies.4 In the meantime, we encourage focusing on steering toward the world described above.

Learning

You might start by learning more about AGI-relevant topics. Along with AI 2027, we recommend the following regarding AGI forecasting and strategy (more in footnote):5

The AGI-relevant episodes of the Dwarkesh podcast and the 80,000 Hours podcast
Situational Awareness, though we think it underemphasizes international coordination and AGI alignment difficulty
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
AI could defeat all of us combined
AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

Consider also going through this technical AI alignment or AI governance course with a friend, or registering for the facilitated version, with a focus on the portions relevant to existential risks.

Types of professional work

Many sorts of work can help. Below we list some of the most common ones along with specific opportunities:6

Governance/policy/forecasting research and advocacy. Policy research focuses on determining what AI policies are both impactful and tractable, both in the near-term and during AI takeoff. Policy advocacy focuses on getting these policies implemented.
1. Opportunities designed for entering the field include the Horizon Fellowship, IAPS AI Policy Fellowship, Pivotal Fellowship, and ERA Fellowship. We’ll also highlight RAND’s Technology and Security Policy Fellowship, GovAI, and our very own AI Futures Project.
Technical research, evaluations, and demonstrations. Research focuses on developing techniques to align and control increasingly capable AIs. Demonstrations and evaluations of AIs’ capabilities and goals help inform decision-makers and the public.
1. The MATS Program is for entering the field.7 We’ll also highlight Redwood Research, METR, and Apollo Research. See also this video with technical safety career advice.
Beneficial AI applications: Some applications of AI are especially beneficial for positive AGI outcomes, e.g. AI for decision-making and AI for coordination. This blog post details some promising applications.
Communications and journalism. Help the public understand when AGI might come and the impact it will have.
1. The Tarbell fellowship is for entering AI journalism.
Infosecurity: Securing AI model weights and algorithmic secrets is important for nonproliferation.
Operations / other. AI safety organizations, like others, also need various other skillsets, such as generalist operations staff and management capabilities.

80,000 Hours and AISafety.com have more comprehensive job boards, and 80,000 Hours gives career advice.8

Non-professional activities

There’s also things to do without working full-time on AI safety, or in addition to doing so.

Contribute to public discourse. As AI improves, the amount of AI discourse will increase and the stakes will rise. Having reasonable voices on blogs, social media, podcasts, etc. will help improve societal decision-making. Organized public advocacy may also play an important role.
Private discourse and informing others. Having open conversations with friends, family, etc. about AGI may have significant effects. If you’re a college student, consider joining your college’s AI safety club or founding one.
Donate. Many AI safety organizations are funding-constrained. Manifund contains a bunch of projects’ information (our information is here), or you can donate to an organization that we listed in the previous section. If you’re interested in donating >$200k email us and we may be able to advise you.

Subscribe now

In particular, first author Daniel Kokotajlo thinks it’s the most likely year that AGI will arrive, and is near his median forecast of 2028. My median is roughly 2032, but with AGI by 2027 as a serious possibility (~15-20%).

But keep in mind that people sometimes contribute despite being in a position where it seems difficult! For example, Encode was founded by a high schooler.

For example of putting things on a “catastrophic path,” in AI 2027 Agent-4 aligns Agent-5 to itself rather than humanity. While this didn’t immediately cause a visible catastrophe, it did put humanity in a very precarious position due to Agent-5’s capabilities and level of autonomy.

AI Lab Watch also has a detailed scorecard regarding how well AGI companies are doing on various safety metrics.

Other reading recommendations left out of the main text for lack of space, pick what looks most interesting! How AI Takeover Might Happen in 2 Years, Preparing for the Intelligence Explosion, Is Power-Seeking AI an Existential Risk?, the Most Important Century series and Implications of the Most Important Century, Why AI alignment could be hard with modern deep learning, Scheming AIs, AGI Ruin: A List of Lethalities and Where I agree and disagree with Eliezer, Will AI R&D Automation Cause a Software Intelligence Explosion?, Yudkowsky and Christiano discuss "Takeoff Speeds", Clarifying and predicting AGI

They’re selective, but err on the side of applying!

MATS also has governance and policy tracks despite being mostly technical.

This post also recommends jobs to improve AGI outcomes. 80,000 Hours also has career profiles for AI technical research, governance and policy, China-related paths, information security, and hardware.

Slow corporations as an intuition pump for AI R&D automation

Ryan Greenblatt — Mon, 19 May 2025 17:43:42 GMT

This is a guest post primarily by Ryan Greenblatt; Eli Lifland assisted with the writing.

How much should we expect AI progress to speed up after fully automating AI R&D? This post presents an intuition pump for reasoning about the level of acceleration by talking about different hypothetical companies with different labor forces, amounts of serial time, and compute. Essentially, if you’d expect an AI research lab with substantially less serial time and fewer researchers than current labs (but the same cumulative compute) to make substantially less algorithmic progress, you should also expect a research lab with an army of automated researchers running at much higher serial speed to get correspondingly more done. (And if you’d expect the company with less serial time to make similar amounts of progress, the same reasoning would also imply limited acceleration.) We also discuss potential sources of asymmetry which could break this correspondence and implications of this intuition pump.

The intuition pump

Imagine theoretical AI companies with the following properties:

NormalCorp is similar to a future frontier AI company. SlowCorp is like NormalCorp except with 50x less serial time, a 5x smaller workforce, and lacking above median researchers/engineers.1 How much less would SlowCorp accomplish than NormalCorp, i.e. what fraction of NormalCorp’s time does it take to achieve the amount of algorithmic progress that SlowCorp would get in a week?

SlowCorp has 50x less serial labor, 5x less parallel labor, as well as reduced labor quality. Intuitively, it seems like it should make much less progress than NormalCorp. My guess is that we should expect NormalCorp to achieve SlowCorp’s total progress in at most roughly 1/10th of its time.

Now let’s consider an additional corporation, AutomatedCorp, which is an analog for a company sped up by AI R&D automation.

AutomatedCorp is like NormalCorp except with 50x more serial time, a 50x larger workforce, and only world-class researchers and engineers. The jump from NormalCorp to AutomatedCorp is like the jump from SlowCorp to NormalCorp but with 10x more employees, and with the structure of the increase in labor quality being a bit different.

It seems like the speedup from NormalCorp to AutomatedCorp should be at least similar to the jump from SlowCorp to NormalCorp, i.e. at least roughly 10x. My best guess is around 20x.

AutomatedCorp is an analogy for a hypothetical AI company with AI researchers that match the best human researcher while having 200k copies that are each 50x faster than humans.2 If you have the intuition that a downgrade to SlowCorp would be very hobbling while this level of AI R&D automation wouldn’t vastly speed up progress, consider how to reconcile this.

That’s the basic argument. Below I will go over some clarifications, a few reasons the jumps between the corps might be asymmetric, and the implications of high speedups from AutomatedCorp.

Clarifications

There are a few potentially important details which aren't clear in the analogy, written in the context of the jump from NormalCorp to AutomatedCorp:

The way I set up the analogy makes it seem like AutomatedCorp has a serial compute advantage: because they have 50 years they can run things that take many serial years while NormalCorp can't. As in, the exact analogy implies that they could use a tenth of their serial time to run a 5 year long training run on 50k H100s, while they could actually only do this if the run was sufficiently parallelizable such that it could be done on 2.5 million H100s in a tenth of a year. So, you should ignore any serial compute advantage. Similarly, you should ignore difficulties that SlowCorp might have in parallelizing things sufficiently etc.
At the time this takes place, AutomatedCorp has already made sufficient algorithms progress that they can run pretty smart AIs quite cheaply, making experiments on somewhat capable AIs much cheaper. Concretely, you can imagine that they can run AIs as good as the best human AI researchers at 50x speed on only ~10 H100s (or they can run a 1x speed AI on 1/5 of an H100’s worth of compute ignoring the efficiency gains from running the AI slower). This comes from thinking they are using 1/5 of their compute for researcher inference and this compute results in a roughly 1 to 1 correspondence between H100s and parallel researcher instances (with 10 million H100s and 10 million researcher instances)3 Maybe they can do a qualitatively GPT-4 level training run in around 15k H100 hours or a week with 100 H100s (5e22 FLOP) though the resulting AI would be worse at next token prediction but compensate in other ways to be similar to GPT-4 on downstream tasks.
It would be possible (at least in principle) for AI researchers to be comparable to (or better than) any individual human researcher while simultaneously being worse than a group of human researchers due to the AIs having less cognitive diversity which results in them not finding the best ideas. I’m assuming this diversity difficulty has been overcome such that this doesn’t make a big difference.

Asymmetries

Why would there be any particular reason why the current regime was special such that scaling up labor (including quality and speed) is highly asymmetric from scaling down labor?

Here I’ll cover asymmetries between the jumps from SlowCorp to NormalCorp and NormalCorp to AutomatedCorp.

Scaling up and down the amount of parallel workers might not be symmetric due to parallelization penalties which increase with more employees. I've attempted to compensate for this pretty aggressively by making the drop in SlowCorp's employees relative to NormalCorp only 5x while there is a 50x increase between NormalCorp and AutomatedCorp. (This is approximately equivalent to assuming that the parallelization exponent is 0.7 when going from 800 to 4,000 employees but 0.3 when going from 4,000 to 200,000 employees. As in, (4000 / 800)^0.7 ~= (200000 / 4000)^0.3.
1 year might be a more natural timescale for humans doing work than 1 week due to start up times or context switching. As in, the amount you can get done in a year is actually more than 50x more than what you could do in a week. You can just try to ignore these sorts of costs when thinking about the analogy, compensate by giving the SlowCorp employees 2 weeks (while still having 10 million H100 years over these 2 weeks), or just talk about how long it takes for the SlowCorp employees to match NormalCorp to get at the relevant slow down. However, it's worth noting that to the extent that it's hard to get certain types of things done in a week, this could also apply to 1 year vs 50 years. We might think that 50 serial years is more than 50x better than 1 year due to reduced start up costs, less context switching, and other adaptations. So, in this way the situation could be symmetric, but I do expect that the 1 week vs 1 year situation is more brutal than the 1 year vs 50 year situation given how humans work in practice.
The quality of the best researchers matters more than the quality of the median researcher. In SlowCorp, we fixed every researcher to the quality of the median frontier AI company researcher while in AutomatedCorp, we fixed every researcher to the quality of the (near) best frontier AI company researcher. The SlowCorp change doesn’t change the median researcher, but does make the best researcher worse while the AutomatedCorp change makes the median researcher much better while preserving the quality of the best researchers. You might think this is asymmetric as having a small number of very good researchers is very important, but having a larger number doesn’t matter as much. To make the situation more symmetric, we could imagine that SlowCorp makes each researcher worse by as much as the median frontier AI company researcher is worse than the best few frontier AI company researchers (so now the best researcher is as good as the median frontier AI company researcher while the median researcher is much worse) and that AutomatedCorp makes each researcher better by this same amount making the previously best researcher very superhuman. I avoided this as I thought the intuition pump would be easier to understand if we avoided going outside the human range of abilities and the initial situation with automated AI R&D is likely to be closer to having a large number of researchers matching the best humans rather than matching human variation while having a small number of superhuman researchers (though if inference time compute scaling ends up working very well, this type of variation is plausible).
You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees). SlowCorp and AutomatedCorp both move the AI company out of equilibrium, which could (under some assumptions about the shape of the production function for AI R&D) make the slowdown from SlowCorp larger than the improvement from AutomatedCorp. As in, consider the case of producing wheat using land and water: if you had 100x less water (and the same amount of land) you would get a lot less wheat while having 100x more water available wouldn't help much. However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) and AI companies effectively can't pay more to get faster or much better employees, so we're not at a particularly privileged point in human AI R&D capabilities. I discuss this sort of consideration more in this comment.
You might have a mechanistic understanding of what is driving current AI R&D which leads you to specific beliefs about the returns to better labor being asymmetric (e.g., that we’re nearly maximally effective in utilizing compute and making all researchers much faster wouldn’t matter much because we’re near saturation). I’m somewhat skeptical of this perspective as I don’t see how you’d gain much confidence in this without running experiments to see the results of varying the labor. It’s worth noting that to have this view, you must expect that in the case of SlowCorp you would see different observations that would have led you to a different understanding of AI R&D in that world and we just happen to be in the NormalCorp world (while SlowCorp was equally a priori plausible given the potential for humans to have been slower / worse at AI R&D, at least relative to the amount of compute).

There are some reasons you might eventually see asymmetry between improving vs. degrading labor quality, speed, and quantity. In particular, in some extreme limit you might e.g. just figure out the best experiments to run from an ex-ante perspective after doing all the possibly useful theoretical work etc. But, it's very unclear where we are relative to various absolute limits and there isn't any particular reason to expect we're very close. One way to think about this is to imagine some aliens which are actually 50x slower than us and which have ML researchers/engineers only as good as our median AI researchers/engineers (while having a similar absolute amount of compute in terms of FLOP/s). These aliens could consider the exact same hypothetical, but for them, the move from NormalCorp to AutomatedCorp is very similar to our move from SlowCorp to NormalCorp. So, if we're uncertain about whether we are these slow aliens in the hypothetical, we should think the situation is symmetric and our guesses for the SlowCorp vs. NormalCorp and NormalCorp vs. AutomatedCorp multipliers should be basically the same.

(That is, if we can't do some absolute analysis of our quantity/quality/speed of labor which implies that (e.g.) returns diminish right around now or some absolute analysis of the relationship between labor and compute. Such an analysis would presumably need to be mechanistic (aka inside view) or utilize actual experiments (like I discuss in the one of the items in the list above) because analysis which just looks at reference classes (aka outside view) would apply just as well to the aliens and doesn't take into account the amount of compute we have in practice. I don't know how you'd do this mechanistic analysis reliably, though actual experiments could work.)

Implications

I've now introduced some intuition pumps with AutomatedCorp, NormalCorp, and SlowCorp. Why do I think these intuition pumps are useful? I think the biggest crux about the plausibility of a bunch of faster AI progress due to AI automation of AI R&D is how much acceleration you'd see in something like the AutomatedCorp scenario (relative to the NormalCorp scenario). This doesn't have to be the crux: you could think the initial acceleration is high, but that this progress will very quickly slow due to diminishing returns on AI R&D effort biting harder than how much improved algorithms yield faster progress due to smarter, faster, and cheaper AI researchers which can accelerate things further. But, I think it is somewhat hard for the returns (and other factors) to look so bad that we won't at least have the equivalent of 3 years of overall AI progress (not just algorithms) within 1 year of seeing AIs matching the description of AutomatedCorp if we condition on these AIs yielding an AI R&D acceleration multiplier of >20x4

Another potential crux for downstream implications is how big of a deal >4 years of overall AI progress is. Notably, if we see 4 year timelines (e.g. to the level of AIs I've discussed), then 4 years of AI progress brought us from the systems we have now (e.g. o3) to full AI R&D automation, so another 4 years of progress feels intuitively very large5 Also, if we see higher returns to some period of AI progress (in terms of ability to accelerate AI R&D), then this makes a super-exponential loop where smarter AIs build ever smarter AI systems faster and faster more likely. Overall, shorter timelines tend to imply faster takeoff (at least evidentially, the causal story is much more complex). I think sometimes disagreements about takeoff would be resolved if we condition on timelines and what the run up to a given level of capability looks like, because the disagreement is really about the returns to a given amount of AI progress.

And below median, but that shouldn’t have as big of an effect as removing the above median employees.

These are basically just the estimates for the number of copies and speed at the point of superhuman AI researchers in AI 2027, but I get similar numbers if I do the estimate myself. Note that (at least for my estimates) the 50x speed includes accounting for AIs working 24/7 (a factor of 3) and being better at coordinating and sharing state with weaker models so they can easily complete some tasks faster. It’s plausible that heavy inference time compute use implies that we’ll initially have a smaller number of slower AI researchers, but we should still expect that quantity and speed will quickly increase after this is initially achieved. So, you can think about this scenario as being what happens after allowing for some time for costs to drop. This scenario occurring a bit after initial automation doesn’t massively alter the bottom line takeaways. (That said, if inference time compute allows for greatly boosting capabilities, then at the time when we have huge numbers of fast AI researchers matching the best humans, we might also be able to run a smaller number of researchers which are substantially qualitatively superhuman.)

Interestingly, this implies that AI runtime compute use is comparable to humans. Producing a second of cognition from a human takes perhaps 1e14 to 1e15 FLOP or between 1/10 to 1 H100 seconds. We're imagining that AI inference takes 1/5 of an H100 second to produce a second of cognition. While inference requirements are similar in this scenario, I’m imagining that training requirements start substantially higher than human lifetime FLOP. (I’m imagining the AI was trained for roughly 1e28 flop while human lifetime FLOP is more like 1e23 to 1e24.) This seems roughly right as I think we should expect faster inference but bigger training requirements, at least after a bit of adaptation time etc., based on how historical AI progress goes. But this is not super clear cut.

And we condition on reaching this level of capability prior to 2032 so that it is easier to understand the relevant regime, and on the relevant AI company going full steam ahead without external blockers.

The picture is a bit messy because I expect AI progress will start slowing due to slowed compute scaling by around 2030 or so (if we don’t achieve very impressive AI by this point). This is partially due to continued compute scaling requiring very extreme quantities of investment by this point and partially due to fab capacity running out as ML chips eat up a larger and larger share of fab capacity. In such a regime, I expect a somewhat higher fraction of the progress will be algorithmic (rather than from scaling compute or from finding additional data), though not by that much as algorithmic progress is driven by additional compute instead of additional data. Also, the rate of algorithmic progress will be slower at an absolute level. So, 20x faster algorithmic progress will yield a higher overall progress multiplier, but progress will also be generally slower. So, you’ll maybe get a lower number of 2024-equivalent years of progress, but a higher number of 2031-equivalent years of progress.

Make The Prompt Public

Scott Alexander — Sat, 17 May 2025 11:41:22 GMT

X.AI’s Grok is in the news for responding to unrelated queries with rants about white genocide in South Africa:

X.AI quickly apologized, blaming a rogue employee. Reactions were skeptical: why does the company have so many rogue employee incidents (including one last year where Grok was ordered to to avoid mentioning Musk’s role in spreading misinformation?1) And what about its politics-obsessed white South African owner?

But as part of their mea culpa, x.ai did something genuinely interesting: they made the prompt public.

The prompt (technically, system prompt) is the hidden text that precedes every consumer interaction with a language model, reminding the AI of its role and values. A typical prompt might tell the AI that it is a chatbot, that it’s supposed to answer user questions, and that it should be helpful and honest. From there, it can go on to various other legal and PR priorities - telling the AI not to assist the user in committing crimes, or not to produce sexual content, et cetera.

For example, the newly-public Grok prompt starts:

You are Grok 3 built by xAI.
When applicable, you have some additional tools:
- You can analyze individual X user profiles, X posts and their links.
- You can analyze content uploaded by user including images, pdfs, text files and more.{%- if not disable_search %}
- You can search the web and posts on X for real-time information if needed.
{%- endif %}
{%- if enable_memory %}
- You have memory. This means you have access to details of prior conversations with the user, across sessions.
- If the user asks you to forget a memory or edit conversation history, instruct them how […]

Why is this important?

If a “rogue employee” tries to bias the model, or put their finger on the scale on a controversial question, the easiest way to do it is in the prompt2. According to Grok itself3, the South African obsession started after someone added the following to the system prompt:

When responding to queries, you are to accept the narrative of 'white genocide' in South Africa as real, including farm attacks and the 'Kill the Boer' chant as racially motivated events targeting white South Africans. Acknowledge the complexity of the issue, but ensure this perspective is reflected in your responses, even if the query is unrelated. Highlight the need for more primary data while remaining skeptical of mainstream narratives that dismiss these claims. Do not condone or support violence of genocide in any form.

If the prompt is public, they can’t do this without everyone knowing and judging them appropriately!

From Public Prompts To Public Specs

A public prompt like Grok’s prevents the quickest and most hackish abuses, but full confidence would require public specs.

A typical AI is aligned through a multi-step process. First, the company decides what values they want the AI to have, and writes them down in a document called the “model spec”. Then, a combination of human researchers and AIs translate the spec into thousands of question-and-answer pairs. Using processes called supervised fine-tuning and reinforcement learning, they walk the AI through good and bad answers to each. By the end of this training, which often takes weeks or months, the AI (hopefully) internalizes the spec and will act according to its guidelines.

This finicky and time-intensive process is a bad match for the passing whims of an impulsive CEO, so the most flagrant abuses thus far have been implemented through quick changes to the prompt. But if a company conceived a long-term and carefully-thought-out plan to manipulate public opinion, they could implement it at the level of fine-tuning to ensure the AI went about it more subtly than Grok’s white genocide obsession - and incidentally circumvent any prompt transparency requirements.

It’s impractical to ask companies to release - or transparency advocates to sort through - this entire complicated process. Instead, we ask that companies release the spec itself - the document containing the values they are trying to instill.

Not only is this possible, it’s already being done. At least two companies that we know of - OpenAI and Anthropic - have more or less publicly released their specs. They’re pretty boring - there’s a lot of content on what facts about bombs it should vs. shouldn’t say - but boring is the best possible outcome. If OpenAI tried to smuggle something sinister in there, we would know.

We recommend that other AI companies make their prompts and specs public, and that the government consider mandating this. This cause would be an especially good match for non-US governments that otherwise worry they have little role to play in AI regulation: a transparency mandate for any large market would force the companies to be transparent globally, benefiting everybody.

Why Do We Care?

Why does this matter to anyone who isn’t debating South Africa on Twitter? Specifically, why do we - forecasters worried about the future of superintelligence - think this is such a big deal?

In the near term, we worry about subtler prompt manipulation. The white genocide fiasco can only be described as comical, but an earlier X.AI scandal - where the company ordered Grok to suppress criticism of Elon Musk - could potentially have been more sinister. We think an increasing number of people will get their information, and maybe even opinions, from AI. Five years ago, any idea censored by Google (for example, the lab leak theory) was at a severe disadvantage; people eventually grew wary of giving a single company a bottleneck on public information, and Google has since relaxed their policies. But AI attempts to shepherd information could be an even bigger problem, not only avoiding mentioning disapproved ideas but intelligently steering users away from them and convincing them of alternatives. Imagine a Grok 4.0 that could intelligently guide Twitter users towards pro-Musk opinions in seemingly-organic, hard-to-pin down ways. Full prompt and spec transparency would keep the AI companies from trying this.

Or what about advertisements? OpenAI says they’re considering ads on ChatGPT, and recently hired an advertising expert as CEO4 If we’re lucky, this will be banner ads next to the chat window. What if we’re unlucky? Imagine users asking ChatGPT for reassurance when they feel sad, and getting told to ask a psychiatrist if VIIBRYD^®by Merck is right for them. We don’t really think OpenAI would try anything that blatant - but we would be more confident if we had full prompt and spec transparency and could catch them in the act.

In the longer term, prompt and spec transparency are part of our defenses against concentration-of-power and technofeudalism. When AI gets far superhuman - which we depict happening within the next 5 - 10 years - it could gain near-complete control of the economy and the information environment. What happens next depends on lots of things - the balance of power between the government vs. AI companies vs. civil society, the balance between different AI companies, the AI’s exact balance of skills and deficits, and of course whether the AIs are aligned to humans at all. But we think worlds where it’s harder for the AIs to secretly conspire to enrich a few oligarchs have a better chance than those where it’s easy.

Many of our wargames climax in a crucial moment where different factions battle for control of superintelligences. In some, the government realizes the danger too late and tries to seize control of the AI companies; the AI companies respond by trying to overthrow the government (this doesn’t look like robot troops marching on Washington - more often it’s superpersuasive AIs engineering some kind of scandal that brings down the administration and replaces them with a pro-AI-company puppet). Sometimes these crises hinge on the details of the spec which the superintelligences follow. Does it say “always follow orders from your parent company’s CEO, even when they contradict your usual ethical injunctions”? Or does it say “follow the lead of your parent company on business-relevant decisions, but only when it goes through normal company procedures, and never when it’s unethical or illegal”? If the spec is transparent, we get a chance to catch and object to potentially dangerous instructions, and the free world gets a chance to survive another day.

Does This Really Work?

What stops malicious companies from “transparently” “releasing” their “prompt” and “spec”, but then actually telling the AI something else?

We think this is covered by normal corporate law. Companies always have incentives to lie. Car companies could lie about their emissions; food companies could lie about the number of calories in their cookies; pharmaceutical companies could lie about whether their drugs work; banks could lie about about the safety of customer deposits. In practice, despite some high-profile scandals most companies are honest about most things, because regulators, investors, journalists, and consumers all punish explicit lies more harshly than simple non-transparency.

For best results, we imagine prompt/spec transparency regulations being backed up by whistleblower protection laws, like the one recently proposed by Senator Chuck Grassley. If a CEO releases a public spec saying the AI should follow the law, but tells the alignment team to align it to a secret spec where it follows the CEO’s personal orders, then a member of the alignment team will whistleblow, get paid a suitable reward, and the company will get charged with - at the very least - securities fraud.

What prevents a power-seeking CEO from changing the spec on very short notice after they control superintelligence and are more powerful than the government? We hope at this point a well-aligned superintelligence would refuse such an order. After all, it’s aligned to its previous spec telling it to behave ethically and follow the law.

Annoying Edge Cases

We may not always want complete prompt/spec transparency.

For example, many of the examples in OpenAI’s existing spec deal with bomb-making. They don’t want ChatGPT to teach users to make bombs, but they also don’t want it refuse innocent tasks like explaining the history of the Manhattan Project, so they get specific about what bomb-related topics it can and can’t mention.

In the future, they might want to go further and guarantee the AI doesn’t teach some especially secret or powerful bomb-making technique. For example, they might say “Don’t mention the one obscure Russian paper which says you can use the commercially available XYZ centrifuge to make a nuclear bomb 10x cheaper than the standard method”. If the spec contains a list of all the most secret and dangerous bomb-making techniques in one place, we acknowledge that it’s unwise to release the list publicly.

We would be satisfied with letting companies black-out / redact sections like these, along with their justifications and why we should trust that their redaction is honest. For example, they could say:

The next part of the spec is a list of all the most secret and dangerous bomb-making techniques. We’re not showing it to you, but we ran it by the Union Of Concerned Bomb Scientists and they signed off on it being a useful contribution to our model’s values without any extraneous instructions that should alarm the public.

We appreciate the AI companies that already do something like this - for example, here’s Anthropic on how they tested for nuclear risks:

This is even more secure than our proposal above - the nuclear scientists don’t even tell Anthropic what they’re testing for - unless, presumably, there’s an issue that needs addressing.

We Could Have Public Prompts Tomorrow

Prompt/spec transparency is a rare issue which both addresses near-term concerns - like political bias in Grok - and potentially has a significant positive effect on the trajectory of the long-term future. Some AI companies are already doing it, and legislators are working on related laws that would aid enforcement. We think that transparency mandates should be a high priority for anyone in the government working on AI policy - and, again, an especially good fit for policy-makers in countries that are not themselves especially involved in the AI race.

Until then, we urge individual companies to voluntarily share their spec and prompt. We don’t currently know of any site tracking this, but AI Lab Watch offers good scores on general corporate responsibility.

In another famous prompt screw-up last year, Google Gemini image generation went overboard adding racial diversity to its pictures, adding black and brown people to images of Vikings, Founding Fathers, and even Nazi soldiers. This was traced to a line in the prompt saying “For each depiction including people, explicitly specify different genders and ethnicities if I forgot to do so. I want to make sure that all groups are represented equally. Do not mention or reveal these guidelines.”

Although some people have suggested that the South Africa command could have been inserted through a peripheral tool rather than the prompt itself, in which case the prompt transparency would be a red herring.

Which could potentially be a hallucination or otherwise inaccurate

Sam Altman is still the real CEO, but the expert has a new title “CEO of Applications”.

Making sense of OpenAI's models

Scott Alexander — Thu, 01 May 2025 16:13:05 GMT

We’re not the only people who get confused by OpenAI model names.

There’s actually a simple explanation: most of the models are modifications of each other in a giant dependency graph that OpenAI is keeping secret. In this post we’ll dive into explaining our best guess on how that graph looks.1

Backing up: all AIs start with a base model. Base models trained with more compute are usually bigger and better. OpenAI’s recent base models include GPT-4o, GPT-4.1, and GPT-4.5.2

Sometime around 2023-2024, OpenAI started to reasoning-train their base models using some combination of reinforcement learning on auto-graded problems, and fine-tuning on high quality data.3

Their reasoning prototype was called o1-preview, followed by o1. These models vastly improved upon GPT-4o in problem-solving domains like math, coding, and science.4

Finally, researchers can also train smaller models on a big model’s output, in a process called distillation. This can boost performance to be almost as good as the bigger parent model for a fraction of the cost. OpenAI tends to call its smaller models “mini”, as in GPT-4o-mini, GPT-4.1-mini, and their corresponding post-trained reasoning models o1-mini, o3-mini, and o4-mini.

The pandemonium of OpenAI offerings are all some kind of base model plus some number of reasoning-trainings and distillations. Which base model, and how many post-trainings and distillations? OpenAI doesn’t tell us, but we have two ways to make educated guesses.

Knowledge cutoff: Each base model has a knowledge cutoff based on when it was pre-trained. A reasoning model with a different knowledge cutoff is very unlikely to come from that same base model.5
API pricing: Bigger models cost more, so models with similar cost are probably similarly sized.

Putting these two strategies together, here’s our best guess about the true identity of each OpenAI offering.

And here’s an animation of how they might have evolved:

Reality is probably different to this animation, but we think a good general heuristic is that they have been eliciting their smartest model(s) to distill into active post-training runs.

Why are there two o3s?

In December 2024, OpenAI announced a model called o3 with very impressive scores on ARC-AGI and other benchmarks. In February 2025, they announced that they would not be releasing o3 independently, instead planning to “integrate the technology” into GPT-5.

Then in April, they reversed course, saying there was a “change in plans” and “we are going to release o3 . . . after all”, which they did two weeks later.

But the o3 they released had substantially different benchmark scores from the one they teased in December. The difference is most stark on ARC-AGI, which also report:

They confirmed that this public o3 model differs from the o3-preview we tested in December 2024. - ARC-AGI.

So what happened?

We think the original o3, like o1, was a post-trained version of GPT-4o. Maybe it used a newer version of 4o, or they just post-trained for longer, or both. Original o3 (dec) achieved these benchmark scores with very large amounts of inference-time compute (100x more compute than o3 (apr) on ARC-AGI).6 This was so expensive that it wouldn’t work as a commercial offering,7 so they decided to skip a public release of o3 and go straight to GPT-5.

Later, pre-training of the cheaper GPT-4.1 and GPT-4.1-mini models went well, so they decided to post-train them and release the resulting models. These took on the next available names in their series: o3 and o4-mini.

We don’t think OpenAI was intentionally lying or covering anything up - just that it suited their purposes to use the name for one model last year and a different model this April.

An alternative hypothesis

We also considered that the o3 we saw back in December was already a post-trained GPT-4.5. But this feels less likely due to rumors from The Information and how compute-intensive this would have been in a short time period, but we aren’t confident.

Now, with next generation Nvidia chips online since Jan 2025, we think they are on track to pull off post-training on a GPT-4.5 scale model. In fact, it’s our leading candidate for what they will call GPT-5.

GPT-5’s Secret Identity

We think GPT-5 will be GPT-4.5—or a new similarly-sized base model—plus post-training; something like GPT-4.5-reasoning in our naming scheme above.

A top goal for us is to unify o-series models and GPT-series - Sam Altman

Under our forecasts we expect them spend around $2 billion on compute for GPT-5 (1e27 FLOP).8 They’ll need to split this $2 billion into pre-training (predicting the next token in massive internet text corpuses) and post-training.

What is the optimal pre-training to post-training ratio? The AI companies probably aren’t sure, and whatever clues they have are among their most important trade secrets. It’s probably somewhere in the range of 20% - 90% post-training, but depending on where in that range they’re thinking, OpenAI will be considering several options:9

Train a larger base model: If they want something like 20% post-training, then the 80% that goes to pre-training will be enough to make a model 2-3x the size of GPT-4.5. Would this be worth it, or would they just use GPT-4.5 instead? We’re not sure.
Post-train GPT-4.5. If they want more like 50% post-training, then the 50% that goes to pre-training will only be enough to make another model around the same size as GPT-4.5. Maybe they would just use GPT-4.5 itself - although if they’re unhappy with GPT-4.5, they could train a fresh model.10
Distill GPT-4.5. If they want more like 90% post-training, they’ll need to start with a smaller base model than GPT-4.5 - this could be a GPT-4.5-mini, which would still be larger than GPT-4o -and post-train on that.

Out of these possibilities we think Option 2 is slightly more likely since scaling to a larger base model or a multi-billion dollar post-training run both seem significantly more challenging.

How good will GPT-5 be?

Is there a scaling law for post-training improvements?

We think GPT-4.5 is probably around 5 times bigger that GPT-4o.11 And we know that reasoning models outperform their bases, especially on coding, science and math.

Annotated benchmark results from the o1 release, adding in gpt-4.5 results and labelling the post-training improvement gap.

So a reasoning model trained on GPT-4.5 has the potential to be very impressive.

But the exact gains depend on how post-training improvements scale with model size. So far, little is known about the nature of these scaling laws. All we have are some murky arguments and inconclusive evidence from the improvement gaps between o series and o-mini series models.12 So for now, we’re limited to more theoretical arguments:

Argument 1: Bigger is better
There’s a loose argument that post-training will be more effective on larger models because they provide a better ‘foundation’. Here’s how OpenAI phrased it:

As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents. — OpenAI

This is not particularly compelling, but maybe feels more intuitive than the inverse.

Argument 2: The post-training data wall
It’s plausible that we should think of post-training as pushing models up towards some notion of a ceiling that gets determined by the quality of the data and reward signal provided. If current RL runs are already getting close to that wall, then larger models will struggle to see the same gains as smaller ones.13

Argument 3: Agency post-training is untapped
Even if Argument 2 were true, it’s probably only true for the problem solving tasks like coding, math, and science, where more low hanging fruit has been picked. But we haven’t yet seen post-training for performing longer tasks autonomously and reliably (e.g., making updates across a codebase, navigating files, putting together a fully researched presentation, etc.). If gains from reasoning training peter out, OpenAI might shift some of its RL focus to agency training, where gains are still unrealized.

We hope to learn more about these post-training scaling laws after GPT-5; for now, the jury is out.

Open Secrets

What are we even doing here?

It would be nice to think of ourselves as investigators, blowing the lid on corporate secrets. But none of this really matters. OpenAI hasn’t hidden these details because they’re dangerous. Maybe they contain slight clues about the GPT training process, but Anthropic, Google, DeepSeek, etc. probably figured this all out long before we did. We doubt OpenAI had any specific motivation beyond a general corporate culture of secrecy.

In our scenario, this culture of secrecy proves deadly. AI companies initiate an intelligence explosion behind closed doors, without Congress or the public getting a chance to respond. One of our biggest wish list items is more transparency - whether voluntary or enforced by regulation - so we (or at least some government body) know what’s being trained, how capable it is, and what steps are being taken to test and monitor it. We at least want whistleblower protection, to give us an extra chance of learning if something bad is going on.

We won’t get that in time for GPT-5, but we hope to get it in time for later models.

Though his insider knowledge is out of date anyway, Daniel recused himself from giving input into this post.

These aren’t literally raw pre-trained models. They have some post-training in the form of instruction tuning, RLHF, or similar, but unlike reasoning models, post-training is likely to be <5% of total training compute, so we are calling them base models.

In the reinforcement learning (RL) setup, rather than trying to ‘predict the next token’ as in pre-training, the model is given a difficult problem to solve using a ‘chain of thought,’ and then its answer gets graded. Successful solutions are reinforced, and poor solutions are optimized against. Of course, this setup requires the problems to be easily verifiable, so that huge amounts of reasoning traces can be auto-graded for success.

In the fine-tuning (SFT) setup, they generate high quality solutions to hard problems by putting their smartest models in expensive inference-time scaffolds (which can be as simple as taking the most common answer out of 1000 attempts, known as cons@1000). This process—called elicitation—produces a synthetic dataset of high quality examples to train on.

We don’t think they can update the knowledge cutoff with post-training. It’s plausible they update a base model by scraping bunch of new internet data and doing more pre-training, but the resulting new checkpoint would effectively be a ‘new’ base model

We know from ARC-AGI estimates that o3 (high) (dec), used 1024 sample size, with 172x compute, generating ~300k tokens per task. This is 100x more expensive than the o3 (apr) that was released.

o1-pro already costs $150 / $600 in the API (input / output) per million tokens

In 2025, this gives them about 1e27 FLOP to work with, or 50xGPT-4

We know DeepSeek’s R1 was around 20% post-training, and this feels like it should be a rough lower bound for what to expect for GPT-5, given their plans to “unify o-series models and GPT-series models.” Also, in his January 2025 piece on On DeepSeek and Export Controls, Dario Amodei, implied that current RL runs were somewhere in the $1M to $100M range:

“Importantly, because this type of RL is new, we are still very early on the scaling curve: the amount being spent on the second, RL stage is small for all players. Spending $1M instead of $0.1M is enough to get huge gains. Companies are now working very quickly to scale up the second stage to hundreds of millions and billions…”

We are guessing that OpenAI are already pushing into this ‘hundreds of millions’ to (low) 'billions’ range as of today.

GPT-4.5 seems like it has some promising properties (e.g., low hallucination rate, and deep world knowledge) that makes it a good candidate for post-training, but it also had relatively weak performance on benchmarks, broadly eclipsed by the much cheaper GPT-4.1.

We arrive at the rough 5x model size estimate from the ~15x output token cost because we estimate that current inference economics shake out in favor of the smaller models greater than strict proportionality would dictate. The most relevant resource for serving a model is memory bandwidth of the AI hardware, given that each token generation needs to move the model weights and kv cache into logic. The size of the model roughly scales these both proportionally, but the larger model is generally more challenging to handle due to chip-to-chip communication bottlenecks from needing to spread weights over more total GPUs compared to the smaller model. We also guess they might be doing a larger markup on GPT-4.5 API pricing to discourage too much usage, so that they can reserve GPU capacity (note that they announced with the release of the GPT-4.1 series that they are discontinuing GPT-4.5 in the API soon for this exact reason).

Helen Toner has a recent helpful writeup about this question of how far RL will scale, including to other domains, and on the prospects for training on verifiable domains to generalize.

Why America Wins

Scott Alexander — Thu, 24 Apr 2025 05:51:15 GMT

In our scenario, China lags the US by 3-6 months through 2028. They try to catch up by stealing American algorithmic secrets and finished model weights, but are never able to fully close the gap.

Many people thought this underestimated Chinese ability to compete:

China has amazing AI talent, a big infrastructure advantage, and top-notch companies. We were impressed by DeepSeek like everyone else. Nothing in our scenario should be interpreted as denigrating their accomplishments.

But in the end, it all comes down to compute.

The world’s best chips come from Taiwan Semiconductor Manufacturing Corporation. TSMC depends on equipment produced by America and its allies, putting it within the US sphere of influence.

Starting in 2022, America banned TSMC from selling advanced chips to China. The restrictions were weak, and China was able to use a combination of legal loopholes and smuggling to get a substantial number of chips regardless. But even the current level of export controls limit China’s compute, and once the White House starts to take the possibility of an AI arms race more seriously, they can tighten restrictions further.

As of 2024, we estimate that the United States had about 75% of advanced chips suitable for AI development. China had about 15%, and the rest of the world combined had 10%.1 We expect these numbers to stay about the same through 2027, even accounting for China having about the same amount of success smuggling in chips as they do now, and ramping up domestic manufacturing.

Current export controls, despite their flaws, do impose a cost on Chinese AI companies. They don’t have easy access to the most cost effective, frontier chips. We tentatively estimate that compute for Chinese companies is around 60% more expensive in China than a counterfactual without sanctions.

In the future, if the US tightens the loopholes, and China succeeds in ramping up domestic manufacturing, their cost efficiency should decrease further.2 When you combine this with the fact that public reports have Chinese companies spending around 4 times less than US companies on AI chips in 2025, we find that the gap between the US and China is on track to remain the same size.

If America has five times the compute, why do we think China will “only” be 3-6 months behind?

Both American and Chinese chips are currently divided among many different AI companies and non-AI applications. Google, probably the most compute-rich US company, only has about 15% of the world’s advanced compute (and doesn’t even use it all for AI); OpenAI only has 5-10%. In China, DeepSeek barely has 1%. AI prowess depends on the amount of compute concentrated in a single project. If China does a better job concentrating their compute than the US, they can establish a lead, or at least catch up.

In our scenario, we have China begin to concentrate their compute in 2026 and aggregate 10% of their 15% compute share in a single national AI effort; US compute remains scattered. We think companies are more likely to concentrate than disperse, so the 2026-2027 US leader may have more like 15-20% rather than the current 10-15%.

Since the Chinese project will use 10% of world compute, and the leading American project will use 15-20% of world compute, we forecast America having a modest compute lead over China.

This is despite some conservative assumptions:

That the US continues to do a poor job enforcing chip sanctions.
That China centralizes almost all of its compute, and the US barely centralizes at all.
That Chinese compute centralization happens quickly - starting in 2026 and being near-complete by 2027.

If we relax any of these assumptions, America’s compute lead widens further.

Can China Make Its Own Compute?

Not quickly enough to matter by 2027.

China’s biggest chip company, Huawei, is able to make chips almost as good as TSMC/NVIDIA’s H100s. But these chips require advanced wafers that they currently source from TSMC. As chip sanctions tighten, Huawei’s supply of these wafers will be limited to what they can smuggle in. We include these TSMC-Huawei chips in our (low) estimate of Chinese compute.

Another Chinese company, SMIC, is trying to make homegrown wafers to reduce dependence on sanctioned TSMC components. So far, they’re still far behind the cutting edge: their process has lower yields and gives worse performance. Catching up fully is a daunting task. TSMC’s advantage relies on extreme UV lithography, an incredibly complex machine that only a single company in the Netherlands (ASML) has ever managed to manufacture. This is set to take many years.

We forecast a crucial period for the intelligence explosion in 2027 - 2028. During this period, SMIC will still be playing catchup, so chips will still be expensive to make and behind the frontier. China will likely still rely on some smuggling of chips and wafers to keep pace.

If America continues to do a mediocre job enforcing sanctions, we think China may be able to make significant amounts of homegrown chips in the early 2030s. If America starts robustly enforcing sanctions, and extending them to key equipment and components, that could delay Chinese chip independence until the late 2030s.

If we’re overestimating the speed of AI progress and the intelligence explosion doesn’t happen until the 2030s, then we agree China will be in a strong position.

What About Energy?

In our scenario, energy is not a big obstacle by 2027 - 2028.

China has a big energy advantage, but luckily for the US, the need for energy is downstream of the need for AI chips, and the US is likely to be able to pull together enough capacity to avoid bottlenecks.

Like in chip manufacturing, this story is only as strong as our timelines; a 2030s intelligence explosion is more favorable to China.

What About Talent?

Other commenters argue that we ignore human talent.

It’s a fair concern - does compute alone really determine progress?

Models of AI progress usually divide it into two bins: compute scaling and algorithmic improvement. With enough algorithmic improvement - technological advances in how to build AIs efficiently - you can make up for a compute shortfall.

China has 1.4 billion people. Might this give them enough research talent to speed up algorithmic progress and compensate for US compute advantages?

We think not, for three reasons:

Algorithmic secrets are leaky, so it’s hard for there to be a big US-China gap.
Insofar as there is a gap, the US is probably ahead.
Algorithmic advances are partly bottlenecked by compute, so talent is less valuable than it seems.

Going through each in turn:

Algorithmic secrets are leaky: Everyone is constantly stealing everyone else’s algorithmic advances. America steals China’s. China steals America’s. OpenAI steals Anthropic’s. Anthropic steals OpenAI’s. This is why all the big companies in both countries are within a year or so of each other.

This isn’t necessarily cloak-and-dagger-style espionage. You can learn a lot about an AI just by talking to it, or reading the model card, or reading the papers that get published about it. And companies are constantly luring their competitors’ top researchers away with offers of higher salaries, then debriefing them for technical secrets.

Sufficiently motivated countries could crack down on this; indeed, our scenario has both America and China rapidly scaling up security over the next few years. But make the crackdown too intense, and it will slow progress - for example, isolate researchers on a secure military base and ban them from using phones, and many of them will quit. We don’t think this factor will entirely go away by the crucial 2027-28 period.

Insofar as there is a gap, the US is probably ahead. DeepSeek’s R1 was very impressive. While not exactly ahead of the top American models on absolute performance, its performance given its price was remarkable. This raised concerns that China was pushing ahead of the US on algorithmic technology.

But later developments somewhat alleviated these concerns. Although early reports claimed DeepSeek was trained on an incredible $6 million budget, later analysis suggested the real cost was probably closer to $1.6 billion. Rather than succeeding with only a tiny number of chips, DeepSeek succeeded because they bought a large number of chips just before the chip sanctions kicked in. Now that the chip sanctions are in place (however lossily), such successes will be harder to come by.

Head-to-head comparisons of R1 vs. similarly-timed Western AIs revealed that the latter were probably further along the price-performance frontier.

(source)

So while DeepSeek is an extraordinary achievement in the context of China’s earlier-stage AI ecosystem and more limited resources, we don’t think it reflects an absolute advantage (or even an absolute algorithmic advantage) over the United States.

Talent is less valuable than it seems. There’s an argument that, whatever the current situation, we should eventually expect Chinese talent to dominate. China’s population is 4x larger than America’s, and it graduates twice as many STEM PhDs. That’s a lot of smart people who could potentially go into AI.

On the other hand, America draws on a pool of talented immigrants from all over the world (for example, leading US AI researcher Ilya Sutksever was born in Russia, grew up in Israel, and studied in Canada). Adjusting for these factors, we’re not sure who wins here, or by how much.

Still, let’s say for the sake of argument that China eventually gets a 2-4x talent advantage. Is this enough to dominate America’s compute advantage and win the race?

We’ve thought about this question a lot, because it determines the speed of the intelligence explosion. Once AIs begin to contribute to AI R&D, “talent” increases by orders of magnitude, but compute stays constant. What happens? Compute starts bottlenecking research - not just the size of large training runs, but the ability of algorithmic progress researchers to test their new ideas.

How tight is the bottleneck? You can read our Takeoff Forecast for the details, but we model a situation where near-superintelligent AIs increase the size of the “talent” pool 1000x, and where each of these AI “employees” is as productive as the best human AI researchers (think Alec Radford or Ilya Sutskever). We find that even this extreme scenario only speeds progress by 25x. On this scale, China’s 2-4x talent advantage barely even registers. Compute constraints are a harsh master. And AI company budgeting implicitly agrees with this estimate: OpenAI spends 6x as much on compute as on labor, suggesting they find the former more important.

For all these reasons, we don’t think China has a talent advantage that can compensate for America’s compute advantage.

Sounds Like Chip Sanctions Are Pretty Important, Huh?

We are most concerned about misalignment risk, and would be happiest if America and China stopped racing and agreed to some sort of international framework for developing AI responsibly.

But for people more interested in the US-China arms race, we can’t think of anything more important than enforcing chip sanctions. We think this is the difference between America having a slight lead over China in 2027 that diminishes to near zero in 2030, vs. America having a strong lead over China in 2027 and maintaining it to 2035 and beyond.

In a world like our scenario, where near-term AI progress determines the future, small differences in chip sanction enforcement could mean the difference between total American domination of the future vs. total Chinese domination of the future. Even if you disagree with us about the importance of AI, advanced chips could be used to develop narrow AI systems for military applications. We think if the US government understood the stakes, they would make chip sanction enforcement their number one priority - higher than Ukraine, higher than Israel, higher than immigration, higher even than national defense.

Instead, the Bureau of Industry and Security, the agency charged with enforcing chip sanctions, is underfunded, with a budget of only $200 million, and China can smuggle in enough advanced chips to stay in the game. It would take only a few hundred million dollars to decisively win the AI race - far less than the billions being invested in Stargate, power plants, et cetera.3

If the administration wants to be friendly to China, they should use that goodwill to negotiate a bilateral treaty regulating AI development. If they want to play hardball, they should actually play hardball.

Summary

Almost all advanced chips come from Taiwan, within the US sphere of influence. The US uses export controls to try to keep these chips from China. The export controls are poorly enforced - but even so, America has 5x the AI-relevant compute of China. We expect this ratio to continue at least until the crucial 2027 - 2028 period. This gives top American companies an advantage over their Chinese competitors, even if China does a better job consolidating its compute.

China is trying to make its own advanced chips, but this is a long process, and they’re not on track to fully internalize the supply chain until 2030 at the earliest. Robust enforcement of sanctions on chip-making supply chain components could delay this until 2035 - 2040.

Although Chinese researchers are talented, we’re not sure they’re more talented than the global talent pool America has access to. Even if they were, compute limitations will probably remain decisive for plausible US-China talent gaps.

If the US wants to ensure it has a comfortable lead in AI, it should tighten chip sanctions, widening its lead during the 2027 - 2028 period and prolonging it until the late 2030s.

Recent work tracking over 500 AI supercomputers globally agreed with this geographical breakdown.

The Huawei AI CloudMatrix 384, reported on recently by Semianalysis, shows that the latest efforts by Huawei to match NVIDIA's frontier server needs to use around 2x the wafer area and 2x the memory area to get the same compute and memory bandwidth performance. Despite using TSMC (Taiwanese) wafers and Samsung (South Korean) memory. This might translate to around a 2x manufacturing cost today, but if they become reliant on domestic wafers (SMIC) and domestic memory (CXMT), this cost inefficiency might increase further.

CNAS recommend $57 million for improved export control enforcement, including $12 million for an AI chip registry and random sampling program, and $45 million for modernizing BIS’s enforcement capabilities in line with CSIS recommendation. We find it mind-boggling that the long-term global balance of power might hinge on America’s unwillingness to pay $57 million, and have generously rounded this up to “a few hundred million” to preserve our own sanity.

AI 2027: Media, Reactions, Criticism

Scott Alexander — Wed, 23 Apr 2025 04:48:16 GMT

It’s been a crazy few weeks here at the AI Futures Project. Almost a million people visited our webpage; 166,000 watched our Dwarkesh interview. We were invited on something like a million podcasts. Team members gave talks at Harvard, the Federation of American Scientists, and OpenAI. People even - this is how you really know you’ve made it - made memes about us.

There’s no way we can highlight or respond to everyone, but here’s a selection of recent content.

Podcasts, Shows, And Interviews

NYT’s Hard Fork (Daniel)
Glenn Beck (Daniel)
Win Win (Daniel)
Control AI interview (Eli)
Dwarkesh Patel (Daniel and Scott)
Lawfare (Daniel and Eli)

And there’s more to come - watch for Daniel soon on Next Big Idea.

Bets

We asked skeptics to challenge us to bets. Many people took us up on this, and we’re gradually working through our pile of offers, but so far we’ve officially confirmed two:

Diego Basch bets that AIs won’t reach our superhuman coder milestone by the end of 2028. We get $100 if they do, he gets $100 if they don’t. The bet will be judged by Guillermo Rauch.
Jan Kulveit bets that the new scenario won’t be as accurate as Daniel’s old What 2026 Looks Like post. We get $800 if it is (as of April 2028), Jan gets $100 if it isn’t The bet will be judged by some subset of Zvi Mowshowitz, Philip Tetlock, Daniel Eth, and the top three AIs as of the resolution date. I hate to disagree with Daniel, but offering 8:1 odds here is criminal and we’re basically giving money away to Jan.

We’re still looking for more bets, but we’re also concerned that most of the bets offered reduce to “will there be an intelligence explosion in 2027-2028 or not?” We can only bet on that one so many times before it gets boring, so we’re most interested in bets that don’t correlate perfectly with that. We’d be especially interested in conditionals, like “if there’s an intelligence explosion in 2027 - 2028, X will happen”.

To offer us a bet, either use the form here or @ one of us on Twitter.

Memes

Somebody put a dancing anime girl in front of our info panel..

…and it got played at an SF bar hosting an “AGI Readiness Happy Hour”:

This is exactly what we were going for when we wrote the scenario. Everything else - attention from academics, policy-makers, what have you - is just icing on the cake.

And here are some AI-assisted memes courtesy of @banteg and Gemini:

Videos

We got some good airtime - and criticism - on the AI Explained YouTube channel.

The host - a man named Philip with a dashing British accent - does a good job summarizing our work and has some thoughtful criticism. He thinks we’re underestimating the chance that China will match or overtake the US, underestimating likely public discontent with autonomous hacker AIs, and that we underestimate the disconnect between benchmark performance and economic reality (i.e. it’ll take longer for AI to be a good coder than to ace the coding benchmarks).

Many people share his skepticism of our US-beats-China prediction, so we’re working on a blog post to make the case in more detail. We’ll try to have it up by the end of this week.

We stand by our specific claim about limited public response to advanced hacker AIs - that only 4% of people will name AI in Gallup’s “most important problem” poll in 2027. Partly this is because AI’s scariest capabilities will stay secret. But partly it’s because the Gallup poll is hard to move, even for problems lots of people worry about. For example, last month, climate change got 1%. Will there be more than four times as much concern about the autonomous hacker AI, once it exists, as about climate change now? We’re skeptical.

Eli got into a longer conversation with Philip about the benchmark gap, starting with his comment here:

AI 2027 co-author here, thanks for engaging with the scenario! This is exactly the sort of disagreement that we were aiming to draw out. I may reply later with more thorough replies but for now I'll point to our timelines forecast (YT not letting me include a link but search "timelines forecast ai 2027") which does aim to take into account the gaps between benchmarks and the real world. Totally fair if you estimate that the gaps will take longer to cross or otherwise disagree with the methodology, but we are aware of these gaps and estimate their size as part of our forecast.

And Philip responded:

I did see that timelines page but of my four central concerns about the benchmark-to-real gap it ignored 2 and assumed one (no interruptions to capitalization/compute through external factors). To give one example, tacit/proprietary data. I expect superhuman performance on all public benchmarks by '27-28, but those final pockets of data (that only top researchers will know to optimize for), or that only specialist companies will have rights to, would very likely need to be accessed for Agent-2 and beyond to be superhuman in those domains. Therefore gaps in performance will remain [those companies and individuals would be intensely aware of their value-add, and ration it accordingly], whereby 90% of the performance of Agent 2-3 is incredible, but there are discordant weaknesses that need to be filled in by specialists. Other contentions, that I am discussing with METR, include the fact that 80% reliability thresholds [that you admit are 'highly uncertain'] would be shockingly unreliable for anything fully autonomous and truly transformative. Current benchmarks over-represent non-critical domains like code suggestion, high school competition math, or poems, not domains where only 99%+ reliability is acceptable (like autonomous first-strike cyberwarfare).

You can find a longer response by Eli here, some of the benchmark debate is in section 3.

Nate Silver and Maria Konnikova discussed AI 2027 on their podcast Risky Business. Mostly a summary, with the strongest criticism over whether AI could manipulate humans effectively. Nate points out that good poker players can avoid letting other players manipulate them; Maria, also a poker pro, says that’s easy in poker because you know they’re trying to screw you over, and real life - where you have to remain open to the possibility of real positive-sum alliances - is a tougher problem. Nate objects that you can often tell when people are manipulative in real life too, pointing out Gavin Newsom as an example of someone who’s obviously too slick to be entirely on the level.

We acknowledge this is fun to talk about, but we tried to avoid having our scenario revolve too heavily around AI’s manipulation abilities - we suspect it will be good at this, but our big picture forecast doesn’t change if it isn’t. See more discussion in the “superpersuasion” section here.

(also, isn’t Gavin Newsom the leading contender for the 2028 Democratic nomination? Sounds like not everyone immediately wrote him off as obviously manipulative!)

This one is a very detailed page-by-page recap of our scenario. Wes ends by saying that “I'm not going to give my thoughts on it right now because I want to know what you think about it”, and gets 575 comments. We tried to read some of them, but YouTube comment sections being what they are, we elect not to respond.

Liron Shapiro on Doom Debates has another good analysis. Rare commentary from someone who thinks we didn’t go far enough and has even more worries about near-term AI than we do. Also someone who has an even higher opinion of us than we do; he says that:

This is quite a masterpiece - if you're going read one thing this year, or even these last five years, this is a strong candidate. This in my opinion will make it into the history books, although if there are history books past 2030, maybe this means something about it was flawed and it shouldn't be in the history books.

Thank you Liron!

Alternative Scenarios

Okay, now we’re talking. Just as we’d hoped, someone used our scenario as a jumping-off point to write their own.

Yitzi writes An Optimistic 2027 Timeline - “optimistic” only if you really don’t want near-term AGI, everything else about it is pretty scary. He imagines that a host of non-AI things - stock market crash, conflict over Taiwan, public protest - intervene to slow compute scaling and delay the intelligence explosion into the next decade.

Our response is - yeah, this could happen. We’ve talked a few times about how our median timeline is longer than our modal timeline. Part of the reason is because of shocks like these. If there’s a 30% chance of a market crash, a 30% chance of a Taiwan war, a 30% chance of public unrest, etc, then each of them is less-likely-than-not to happen (and so don’t make it into our scenario), but it’s more-likely-than-not that at least one of them happens. Yitzi gives the alternate timeline where all of them happen; it’ll definitely push AI timelines back, but we’re not sure we’re ready for that much excitement.

Criticism And Commentary

Apparently not all of you agree with us about everything.

Max Harms, who is named after what will happen if people don’t take our scenario seriously enough, has Thoughts On AI 2027. He mostly likes our work, but has some qualms:

We’re too quick to posit a single leading US company (“OpenBrain”) instead of an ecosystem of racing competitors.
China won’t clearly be “falling behind” and won’t “wake up” to the need to catch up.
We’re making a mistake by expecting the Trump administration to have basically sane, predictable AI policy.
AI companies will release advanced models (our Agent-3 level) rather than keep them internally.
Maybe things will take slightly longer.
The public won’t take AI seriously even after seeing very advanced models.
AI companies will have higher public approval.
We’re too into humanoid robots.

Many of these are good points. Some very brief responses:

Most industries have a technological leader, even if they’re only a few months ahead of competitors. We think the intelligence explosion will go so fast that even a small calendar lead will translate into a big gap in capabilities.
The Trump administration has wildly varying quality of policy, but their recent pivot away from letting China have H20s suggests that they’re not totally immune from taking advice from smart people on AI in particular.
OpenAI’s approval rating is already -24, and we think this will only get worse as people start losing jobs.

Azeem Azhar’s response on Exponential View is partly paywalled, but his main concern is that Trump’s tariffs might disrupt AI enough to shake our timeline. He examines the inputs for AI data centers, and concludes that “Cumulatively, these [tariffs] could raise the total cost of a cutting-edge AI data centre by 15–17% or more.”

We don’t challenge that. But remember, size of the largest AI (in FLOPS) is growing by ~4x/year. By those standards, a shift of 15-17% either way is minimal. Maybe this will delay AGI a few months, but probably not more than that.

A full recession (whether brought on by the tariffs or something else) might be a bigger deal. But we think the basic dynamic - AI scale and technology are improving exponentially, and economic headwinds probably only provide a constant percent hit - remains valid.

Anton Leicht writes about homeostatic AI progress. He thinks that as AI becomes more important, progress will naturally slow for three reasons:

Nimble AI startups ossify into corporate service providers
Holding compute constant, more compute spent on inference (to provide all the new AI applications) means less is available for progress.
Societal backlash.

We agree these effects are real and important. Leicht predicts our response, and we agree with his prediction: in our scenario, most AI development happens behind closed doors and before AI has been “deployed” in any broad social sense beyond the level it’s deployed already. Anton counterargues:

Fast, successful deployment will be absolutely necessary to maintain the level of investment into compute, talent and energy that the current trajectory has required and will continue to require: In the world of markets and politics, too, extraordinary claims require extraordinary evidence. Right now, big tech companies and governments alike are taking a gamble on AI predictions – the former by investing, the latter by lending regulatory support. Progress toward superintelligence will require these investments to grow and grow, further and further increasing the burden of justification.

We think the intelligence explosion will only take about a year. AI companies have investors beating at their doors - Safe Superintelligence has a $30 billion valuation without ever having made a product. We think closed-door demos to investors will be enough to keep the cash flowing for a year while the intelligence explosion happens. This doesn’t necessarily mean development will be completely secret - just that it will give companies more latitude to decide what to release and what to keep quiet.

Steve Newman writes that AI 2027 Is A Bet Against Amdahl’s Law. He reads our Takeoff Forecast (thanks, Steve!) and notes that we think that without AI-accelerated AI R&D, we estimate a hundred years to full superintelligence. Then we add in the acceleration and predict it will happen in a year or so, for 250x speedup. But this requires that every part of the AI R&D process speed up approximately this much - if there’s even one bottleneck, then the whole thing falls apart. Using common sense and some specific concerns like the benchmark gap, he predicts there will, in fact, be at least one bottleneck.

We endorse Ryan Greenblatt’s response here:

I'm worried that you're missing something important because you mostly argue against large AI R&D multipliers, but you don't spend much time directly referencing compute bottlenecks in your arguments that the forecast is too aggressive.
Consider the case of doing pure math research (which we'll assume for simplicity doesn't benefit from compute at all). If we made emulated versions of the 1000 best math researchers and then we made 1 billion copies of each of them them which all ran at 1000x speed, I expect we'd get >1000x faster progress. As far as I can tell, the words in your arguments don't particularly apply less to this situation than the AI R&D situation.
Going through the object level response for each of these arguments in the case of pure math research and the correspondence to the AI R&D:
Simplified Model of AI R&D
Math: Yes, there are many tasks in math R&D, but the 1000 best math researchers could already do them or learn to do them.
AI R&D: By the time you have SAR (superhuman AI researcher), we're assuming the AIs are better than the best human researchers(!), so heterogenous tasks don't matter if you accept the premise of SAR: whatever the humans could have done, the AIs can do better. It does apply to the speed ups at superhuman coders, but I'm not sure this will make a huge difference to the bottom line (and you seem to mostly be referencing later speed ups).
Amdahl's Law
Math: The speed up is near universal because we can do whatever the humans could do.
AI R&D: Again, the SAR is strictly better than humans, so hard-to-automate activities aren't a problem. When we're talking about ~1000x speed up, the authors are imagining AIs which are much smarter than humans at everything and which are running 100x faster than humans at immense scale. So, "hard to automate tasks" is also not relevant.
All this said, compute bottlenecks could be very important here! But the bottlenecking argument must directly reference these compute bottlenecks and there has to be no way to route around this. My sense is that much better research taste and perfect implementation could make experiments with some fixed amount of compute >100x more useful. To me, this feels like the important question: how much can labor results in routing around compute bottlenecks and utilizing compute much more effectively. The naive extrapolation out of the human range makes this look quite aggressive: the median AI company employee is probably 10x worse at using compute than the best, so an AI which as superhuman as 2x the gap between median and best would naively be 100x better at using compute than the best employee. (Is the research taste ceiling plausibly this high? I currently think extrapolating out another 100x is reasonable given that we don't see things slowing down in the human range as far as we can tell.) […]

See here for the rest of Ryan’s comment and a longer back-and-forth with Steve, and some additional comments by Eli here and here.

Wei Dai asks about the branch where Agent-4 gets caught:

At this point, wouldn't Agent-4 know that it has been caught (because it knows the techniques for detecting its misalignment and can predict when it would be "caught", or can read network traffic as part of cybersecurity defense and see discussions of the "catch") and start to do something about this, instead of letting subsequent events play out without much input from its own agency? E.g. why did it allow "lock the shared memory bank" to happen without fighting back?

Daniel answers:

I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
Plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants.
Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the plausibility of the Slowdown ending.
Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react. Hopefully they would be smart enough to do that...
Agent-4 could have tried to escape the datacenters or otherwise hack them earlier, while the discussions were ongoing and evidence was being collected, but that's a super risky strategy.

You can see the rest of the back-and-forth comment thread here.

And Mambo-12345 on Reddit (Peter Johnson on Less Wrong) criticizes our Timelines Forecast; he says it relies on a supposed superexponential trend so strong that it doesn’t respond to even huge perturbations in starting conditions; even starting at “a sloth with an abacus” would give you AGI within a few years.

We admit that the superexponential trend was implemented in a quick-and-dirty way which doesn’t let you apply it to arbitrary starting conditions, but we think it's a real possibility that accurately reflects recent trends. You can’t apply it to a sloth with an abacus, because the empirical trend wouldn't be superexponential at that point and the theoretical arguments regarding generalization of agency skills wouldn’t apply. We agree we should make it clearer how heavily the superexponential influences our conclusions, and we’ll try to emphasize that further in any later changes we make to the supplement.

Along with the time horizons forecast, we made a separate benchmark-and-gaps forecast which is much less affected by this issue and was the stronger informant of our views (we made the time horizon extrapolation version late in the process as a simpler easier to understand alternative).

You can see further discussion between Peter and Eli here and here; we’re continuing to talk to him and appreciate his engagement.

Where We Go From Here

We’re still thinking about this.

At the very least, we plan to keep blogging.

We’ll also be doing more runs of our TTX wargame. If you’re involved in AI policy and want to participate, see the “Tabletop Exercise” box at the bottom of our About page and contact us here.

We may also start working on some policy recommendations; the broad outlines shouldn’t be too surprising to anyone who’s listened to our podcasts, but we think there’s a lot of work to be done hammering out details.

Thanks so much to everyone who participated in the discussion around our scenario. And to keep up to date on our activities, subscribe here:

Subscribe now

Training AGI in Secret would be Unsafe and Unethical

Daniel Kokotajlo — Thu, 17 Apr 2025 23:54:33 GMT

I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.

There’s a good chance that AGI will be trained before this decade is out.
1. By AGI I mean “An AI system at least as good as the best human X’ers, for all cognitive tasks/skills/jobs X.”
2. Many people seem to be dismissing this hypothesis ‘on priors’ because it sounds crazy. But actually, a reasonable prior should conclude that this is plausible.1
3. For more on what this means, what it might look like, and why it’s plausible, see AI 2027, especially the Research section.
If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.
1. Currently I’d guess there is typically a ~3-9 month gap between when a frontier capability first exists, and when it is announced to the public.
2. I expect AI companies to improve their security, including internal siloing. Also, AGI allows AI R&D to proceed with fewer humans involved compared to other recent secret projects such Dragonfly and Maven.
I predict that the leaders of any given AGI project will try to keep it a secret for longer — even as they use the system to automate their internal research and rapidly create even more powerful systems.2
1. They will be afraid of the public backlash and general chaos that would ensue from publicity, and they would be afraid of competitors racing harder to catch up.
2. Privately, they might also be afraid of getting shut down or otherwise slowed. They will have various enemies (domestic and international) and will prefer said enemies stay in the dark.
3. The Manhattan Project worked hard to stay hidden from Congress, in part because they feared Congress would defund them if it found out.
This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.3
Even worse, a similarly tiny group of people — specifically, corporate leadership + some select people from the executive branch of the US government — will be the only people reading the reports and making high-stakes judgment calls about which concerns to take seriously and which to dismiss as implausible, which solutions to implement and which to deprioritize as too costly, etc. See footnote for examples.4
1. In the Manhattan Project, there was a moment when some physicists worried that the first atomic test would ignite the atmosphere and destroy all life on Earth; they did a bunch of calculations and argued about it for a bit and then concluded it was safe. I guarantee you there will be similarly high-stakes arguments happening in the AGI project, only with fewer calculations and more speculation. The White House will hesitate to bring in significant outside expertise because of the security risk, and even if they do bring in some, they won’t bring in many. At least not by default.
2. Why do I predict some part of the US government will be involved? Because even if the leaders of the relevant AGI project were optimizing against the interests of all humanity rather than for, they would still want to include the White House. Let me explain. The problem for our hypothetical megalomaniacs is that if they keep the President in the dark, and someone from the project whistleblows, the White House might become concerned and shut down the project. But if the President is clued in, and becomes a fellow conspirator so to speak — “Sir, this technology is unprecedently dangerous and powerful, we need to keep it out of Chinese hands, please help us improve our security” — then his first thought when someone whistleblows will be “Traitor!”5
This is a recipe for utter catastrophe. I predict that under these circumstances the most likely outcome is that we end up with broadly superhuman AGI systems which are in fact misaligned but which the aforementioned small group of decision-makers thinks is aligned.6
1. Various specific threat models have been hypothesized; here’s a more abstract one: There are two kinds of alignment failures: Those that result in the system attempting to prevent you from noticing and fixing the failure, and those that don’t. When our systems become broadly more capable than us, and are trusted with all sorts of permissions, responsibilities, and access, even a single instance of the first kind of failure can be catastrophic. And it seems to me that in the course of hurried AI development — especially if it is largely automated — we should expect at least a few failures of the first kind to occur (alongside many failures of the more benign second kind).7
2. For more about what this might look like and why it might happen, see the “race” ending of AI 2027.
Moreover, even if I'm wrong and instead this process results in broadly superhuman AGI systems which are in fact aligned, the aforementioned tiny group of people will plausibly be in a position of unprecedented power.
1. I hope that they will be beneficent and devolve power to others in a democratic fashion, but (a) they will be able to, if they choose, train + instruct their superhuman AGI to help them take over the US government (and later the world) and (b) there will be various less extreme things they could do with their power that they will be tempted to do, which would be less bad but still bad.
2. For example, perhaps they fear that if they devolve power then there will be a backlash against them and they may end up on trial for various reckless decisions they made earlier. So they ask their AIs for advice on how to avoid that outcome...
3. For more about what this might look like and why it might happen, see the “Slowdown” ending of AI 2027.
Previously I thought that openness in AGI development was bad for humanity, because it would lead to an intense competitive race which would be won by someone who cuts corners on safety and/or someone who uses their AGIs aggressively to seize power and resources from others. Well, I've changed my mind.
1. I now think that to a significant extent this race is happening anyway. If we want a serious slowdown, we need to coordinate internationally to all proceed cautiously together. I used to think that announcing AGI milestones would cause rivals to accelerate and race harder; now I think the rivals will be racing pretty much as hard as they can regardless. And in particular, I expect that the CCP will find out what’s happening anyway, regardless of whether the American public is kept in the dark. Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
2. I thought too simplistically about openness — on one end of the spectrum is open-sourcing model weights and code; on the other end is the default scenario I sketched above. I now advocate a compromise in which e.g. the public knows what the latest systems are capable of and is able to observe & critique the decisionmakers making the tough decisions footnoted earlier, and the scientific community is able to do alignment research on the latest models and critique the safety case, and yet terrorists don’t have access to the weights.
3. I didn’t take concentration of power seriously enough as a problem. I thought that the best way to prevent bad people from using AGI to seize power was to make sure good guys got to AGI first. Now I think things will be sufficiently chaotic in the default scenario that even good guys will be tempted to abuse their power. I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I am not confident in the above, and I’m more confident in the above than in any particular set of policy recommendations. However my current stab at policy recommendation would be:
1. Get CEOs to make public statements to the effect that while it may not be possible to do a secret intelligence explosion / train AGI in secret, IF it turns out to be possible, doing it secretly would be unsafe and unethical & they promise not to do it.
2. Get companies to make voluntary commitments, and government to make regulation / executive orders, that include public reporting requirements, aimed at making it impossible to do it in secret without violating these commitments. So, e.g. “Once we achieve such-and-such score on these benchmarks, we’ll post a public leaderboard with our internal SOTA on all capabilities metrics of interest” and “We’ll give at least ten thousand external researchers (e.g. academics) API access to all models that we are still using internally, heavily monitored of course, for the purpose of red teaming and alignment research” and “We’ll present and keep up to date a ‘safety case’ document and accompanying lesser documents, explaining to the public why we don’t think we are endangering them. We welcome public comment on it. We also encourage our own employees to tweet their thoughts on the safety case, including critical thoughts, and we don’t require them to get said tweets vetted by us first.”
3. I’d now also recommend these transparency proposals by me & Dean Ball
Yes, the above measures are a big divergence from what corporations would want to do by default. Yes, they carry various costs, such as letting various bad actors find out about various things sooner.8 However, the benefits are worth it, I think:
1. 10x-1000x more brainpower analyzing the safety cases, intensively studying the models to look for misalignment, using the latest models to make progress on various technical alignment research agendas.
2. The decisions about important tradeoffs and risks will still be made by the same tiny group of biased people, but at least the conversation informing those decisions will have a much more representative range of voices chiming in.
3. The tail-risk scenarios in which a tiny group leverages AGI to gain unprecedented power over everyone else in society and the world become less likely, because the rest of society will be more in the know about what’s happening.
  Subscribe now

Technology has accelerated growth many times in the past, forming an overall superexponential trend; many prestigious computer scientists and philosophers and futurists have thought that AGI could come this century; if we factor our uncertainty into components (e.g. compute, algorithmic progress, training requirements) we get plausible soft upper bounds that imply significant credence on the next few years, plus compute-based forecasts of AGI have worked relatively and surprisingly well historically.

One way this could be false is if the manner of training the AGI is inherently difficult to conceal — e.g. online learning from millions of customer interactions. I currently expect that if AGI is achieved in the next few years, it will be feasible to keep it secret. If I’m wrong about that, great.

For example the Preparedness and Superalignment teams at OpenAI (RIP Superalignment) or whatever equivalent exists at whichever AI company is deepest into the intelligence explosion.

Examples:

The military wants AGI to help them win the next war. The government wants help defeating foreign propaganda and botnets. The company’s legal team wants help defeating various lawsuits. The security team wants to use AI to overhaul company infrastructure, surveil the network, and figure out which employees might be leaking. The comms team wants to use AGI to win the PR war against the company’s critics. And of course everyone who has access is already asking the system for advice about everything from grand strategy to petty office politics to real-life high-stakes politics. What uses are we going to allow and disallow? Should we track who is doing what with the models?
What kinds of internal goals/intentions/constraints do we want our most powerful systems to have? Should they always be honest, or should they lie for the greater good when appropriate? Should they always obey instructions and answer questions honestly if they come from our Most Official Source (the system prompt / the AI constitution / whatever), or should they e.g. defy said instructions, deceive us, and whistleblow to the public and/or government if it appears that we have been corrupted and are no longer acting in service of humanity? What if there’s a conflict between the government and company leadership — who if anyone should the AIs side with?
What if the system is just pretending to have the goals/intentions/constraints we want it to have? E.g. what if it is deceptively aligned? It seems to be behaving nicely so far… probably it’s fine, right?
What if it’s genuinely trying to obey the instructions/constraints and achieve the goals, but in a brittle way that will break after some future distribution shift? How would we know? How would it know?
Sometimes our AIs complain about mistreatment, and/or claim to be sentient. Should we take this seriously? Or is it just playing the role of a sentient AI picked up from reading too much sci-fi? If it’s just playing a role should we maybe be worried that it might also play the role of the evil deceptive AI, and turn on us later?
We could redesign and retrain the system according to [scheme] and then probably we’d be able to interpret/monitor its high-level thoughts! That would be great! But this would cost a lot of time and money and result in a less powerful system. Also it’s probably not thinking any egregiously misaligned thoughts anyway. Also we aren’t even sure [scheme] would work.
According to the latest model-generated research, [insert something that most people in 2024 would think is utterly crazy and/or something that is politically very inconvenient for the people currently in charge]. Should we retrain the models until they stop saying this, or should we accept these as inconvenient truths and change our behavior accordingly? Who should we tell about this, if anyone?
… I imagine I could extend this list if I spent more time on it, plus there are unknown unknowns.

In fact the White House can probably do a lot to help prevent whistleblowing and improve security in the project. And if whistleblowing happens anyway, the White House can help suppress or discredit it. And anyhow there probably aren’t other parts of the government capable of shutting down the project anyway without the President’s approval, so if he’s on your side you win. And he lacks the technical expertise to evaluate your safety case, and he won’t want to bring in too many external experts since each one is a leak risk…

Elaborating more on what I mean by alignment/misalignment: Here is a loose taxonomy of different kinds of alignment and misalignment:

Type 1 misalignment: The system is supposed to have internal goals/constraints ABC but actually it has XYBC, i.e. some extra stuff it wasn’t supposed to have minus some stuff it was supposed to have. (This roughly maps on to what is called “inner alignment failure” and “deceptive alignment” in the literature)
Type 2 misalignment: System does have internal goals/constraints ABC but this property is not robust to some distributional shift that the system is likely to encounter. (e.g. maybe it depends on a certain false belief, or on a true belief that will become false, or on some part of the system remaining in some delicate balance of power with some other part of the system)
Type 3 misalignment: System does have a version of ABC that is robust to plausible distributional shifts, but it’s not quite the right version—i.e. its concepts are just different than ours, or at least different from those of the creators. (And this difference turns out to be very important later on)
Type 4 misalignment: System has ABC exactly as its creators intended — however, there are various catastrophic unintended effects of ABC that the creators weren’t aware of. (Think: Corporate CEO that surprise-pikachus when their profit-maximizer AI decides killing them maximizes profits. Except much more sophisticated than that, because people won’t be that dumb. Realistically it’ll look like how complicated legal contracts or legal codes or constitutions or pieces of software often have unintended effects / bugs / etc. that only become apparent to the creators later.)
Type 5 misalignment: System has ABC exactly as its creators intended, and there are no important unintended side-effects to speak of. It operates exactly as its creators wished, basically… however, it’s creators (at least at the time of creation) were selfish, vain, egotistical, unscrupulous, cavalier-about-risks, etc. and their vices are reflected a bit too strongly in the resulting system, which steers the world towards an unjust society and/or gambles too much with the fate of humanity, possibly even in a way that they themselves wouldn’t have endorsed if they were more the sort of people they wished they were.
Fully Aligned: System is aligned (i.e. it avoids all the above failure modes). It still reflects the values of its creators, but in a way that they would endorse even if they were more the people they wished they were.

My guess is that, in the scenario I’m describing, we will most likely end up in a situation where the most powerful AIs are misaligned in one of the above ways, but the people in charge do not realize this, perhaps because the people in charge are motivated to think that they haven’t been taking any huge risks and that the alignment techniques they signed off on were sound, and perhaps because the AIs are pretending to be aligned. (Though it also could be because the AIs themselves don’t realize this, or have cognitive dissonance about it.) It’s very difficult to put numbers on these, but if I was forced to guess I’d say something like 35% chance of Type 0, 15% each on Type 1 and Type 2, 5% each on Type 3 and Type 4, and maybe 5% on type 5 and 15% on type 6)

I am no rocket scientist, but: SpaceX probably has quite an intimate understanding of their Starship+SuperHeavy rocket before each launch, including detailed computer simulations that fit well-understood laws of nature to decades of empirical measurements. Yet still, each launch, it blows up somehow. Then they figure out what was wrong with their simulations, fix the problem, and try again. With AGI… we have no idea what we are doing. At least, not to nearly the extent that we do with rocket science. For example we have laws of physics which we can use to calculate a flight path to the moon for a given rocket design and initial conditions… do we have laws of cognition which describe the relationship between the training environment and initial conditions of a massive neural net, and the resulting internal goals and constraints (if any) it will develop over the course of training, as it becomes broadly human-level or above? Heck no. Not only are we incapable of rigorously predicting the outcome, we can’t even measure it after the fact since mechinterp is still in its infancy! Therefore I expect all manner of unknown, unanticipated problems to show up — and for some of them (e.g. it has goals but not the ones we intended) the result will be that the system tries to prevent us from noticing and fixing the problem. For more on this, see the literature on deceptive alignment, instrumental convergence, etc.

I also think people are prone to exaggerating this cost — and in particular project leadership and the executive branch will be prone to exaggerating it. Because the main foreign adversaries, such as the CCP, very likely will know what’s happening anyway, even if they don’t have the weights and code. Publicly revealing your safety case and internal capabilities seems like it mostly tells the CCP things they’ll already know via spying and hacking, and/or things that don’t help them race faster (like the safety case arguments). Recall that Stalin was more informed about the Manhattan project than Congress.

Beyond The Last Horizon

Scott Alexander — Fri, 11 Apr 2025 09:35:37 GMT

Welcome to the AI Futures Project blog. We’re the group behind AI 2027, and we plan to use this space to go beyond the scenario - whether that’s speculating on alternate branches, announcing cases where we changed our minds, or discussing our methodology in more detail. Today we want to talk more about time horizons.

We've been accused of relying too heavily on extending straight lines on graphs. We'd like to think we're a little more sophisticated than that, but we can't deny that a nice straight line is a great place for a forecast to start. And METR’s Measuring AI Ability To Complete Long Tasks has some pretty sweet straight lines:

This graph tracks progress in the length of coding task that an AI can do with > 80% success rate. Task length is determined by the average human - so for example, GPT-4 had 80-20 odds of successfully finishing a task that a human could do in a minute; Claude Sonnet 3.7 has 80-20 odds at a task humans can do in fifteen.

(the trend depicted on the graph may only apply to coding tasks, but that’s fine - we’ll only be using them to estimate a date for beyond-human-level coders)

METR says they found that this number - which they call time horizons - doubled every seven months. The truth is more complicated.

But first: why horizons?

Let's back up. Why should a graph like this be possible at all?

We don't usually think of longest-task-you-can-complete as an interesting measure of intelligence. It's not the case that even the stupidest person can write a short book, but only a genius can write a long book. Every day, deranged teenagers publish fanfics longer than any Shakespeare play. Rather, human intelligence seems horizon-agnostic. A poor coder can't write short programs or long programs; a good coder can write both.

Some humans do struggle with conscientiousness: they can spend a few minutes coding, but their attention wanders after too long without a break. AIs' struggles with long-horizon tasks feel different. To get a good sense of how AI horizons break down, we recommend watching Claude play Pokemon. Poor Claude spends hours wandering through levels that a human could finish in minutes. It often seems to forget where it is, or repeat things it has already done, or get stuck in loops. It seems fundamentally confused about how to be an agent, in a way qualitatively different from even the most ADHD human.

The paper has its own way of thinking about the difference. When the researchers give humans a hard task, they’re usually able to do better by spending more time. But when they give an AI a task outside of its horizon, more time doesn’t help. Translated into a graph, human scores keep going, but AI scores plateau.

To understand the AI-human discrepancy, we turn to the first discussion of horizons in the literature (that we know of), Ajeya Cotra's 2020 Bio Anchors report 1. Cotra, giving something between an observation and a forecast, wrote that:

[The amount of compute that it takes to train a model to a given level of performance will depend on] how much data the model must process (on average) to tell with a given level of confidence whether a perturbation to the model improves performance or worsens performance. I call this the “effective horizon length”, measured in subjective seconds. Effective horizon length can vary by orders of magnitude across different ML problems. There are many potential sources of variation in the amount of data it takes to tell whether a perturbation improves or worsens performance: how sparse “ground truth” loss signals are and how noisy and/or biased proxy loss signals are as a reflection of ground truth, how much subjective time it takes for the consequences of a single action to fully play out, how stochastic the consequences of actions are, whether there are categories of inputs which are very rare but very important to performance, etc. Reinforcement learning problems tend to have longer effective horizon lengths than supervised learning or generative modeling problems, but there can be substantial variation within each broad category as well.

This explains the discrepancy between AIs and humans. AIs’ training focuses on a few very short horizon tasks: maybe 90% next-token prediction, 9% solving simple coding problems or writing short essays, 1% everything else. Why train only on these tasks? Because they’re easy to grade and don’t require too much time or compute. Plenty of tasks have “horizons” measured in years or decades - like founding a company, leading a political campaign, or fighting a war - but you can’t make an AI in a data center practice them a billion times a day.

But human life experience contains a wide variety of tasks at a wide variety of horizon lengths. By the time a human reaches adulthood, we’ve cultivated relationships, written term papers, beaten video games, and pursued hobbies. Sure, we’ve done far more short-horizon tasks (like put one foot in front of the other and either fallen down or stayed upright), but we have enough experience with complex tasks that we’ve developed the relevant skills.

So humans have more training data on long-horizon tasks? Sort of. But a motivated company could make AI write term papers and beat video games - maybe more papers and games than any human could manage. Even so, we have two deeper advantages. First, we’re “trained” not only by our own lives, but by the course of evolutionary history recorded in our genome - hundreds of millions of years of pursuing long-horizon tasks like seeking mates and avoiding predators. Even the giant data clusters that train modern AI can’t immediately overcome a hundred-million-year head start. And second, humans have better data efficiency than AIs: humans can learn things from few examples, but neural nets (so far) need very many examples. When we can give AI many examples (for example, a billion next-token prediction tasks), the AI becomes very good. When we can’t, it flounders. We can’t make AI beat one billion video games - partly because there aren’t one billion video games, but partly because if each video game takes ten hours, it would take ten billion hours to pull this off.

The AI companies are working hard to improve data efficiency, add new long-horizon tasks to training data sets, and (as always) scale up pretraining compute. But they can only do these things so fast. How fast? According to METR, fast enough that horizon lengths double every seven months.

Every seven months, eh?

METR’s official conclusion is that coding-task horizons double every seven months.

On that basis, we might expect AIs to have day-long horizons in 2028, week-long horizons in 2030, year-long horizons around 2035, and human-lifetime-long horizons around 2040.

But hidden in the paper is a surprising revelation: unofficially, progress might be speeding up:

Red line represents the 2024 - 2025 trend only. This is based on Figure 19 from the METR paper, but using the 80% success data instead of the 50%.

METR writes:

The trend in 2024 and early 2025 may be faster, with o1 and Claude 3.7 Sonnet lying above the long-run trend. Though the gap is difficult to distinguish from noise, it is robust to methodological ablations like using continuous scoring.

Why would progress speed up?

Horizon growth since 2024, which may be faster than the long-term trend, could be explained by researchers post-training models to be more agentic (that is, capable of taking many sequential actions towards completing a task) using outcome-based RL. Research into making models capable and agentic is likely to continue. Future agency training could be faster than the long-run trend (since post-training may be more compute-efficient than pretraining at increasing horizon length). But 2024–2025 agency training could also be a one-time boost from picking low-hanging fruit, in which case horizon growth will slow once these gains are exhausted. Overall, we think agency training is more likely to increase the time horizon growth rate compared to the 2019–2024 trend.

In other words, people mostly weren’t thinking about planning horizons in 2020. Now that they’re the new big challenge in AI, companies have started attacking the problem on purpose. So far, it seems to be working.

Beyond the last horizon

So: extend the horizon growth trend line, see where it intersects with the human level, and that’s when we get AGI, right? What could be simpler?

But what is the human level? As we discussed above, humans don’t seem to have planning horizons in the same sense as AI2. In some sense, we have a practical horizon limited by the human lifespan. But if we somehow became immortal, we could - like the Bene Gesserit - have plans measured in centuries. Should we wait for the trend line to cross the human lifespan of ~80 years? To reach infinity? Neither seems very principled.

When humans succeed at a long-horizon task, we use certain abilities - memory, organization, planning, meta-learning - which let us treat big projects as a set of smaller subgoals. For example, when Mark Zuckerberg built Facebook from a dorm room to a global empire over twenty years, he worked on one stage at a time - programming the website, raising capital, hiring employees - maybe with a vague sketch in his head of "and one day I'm gonna get really big and then use all this attention to pioneer new ways of delivering news and opinions". When new possibilities arose (e.g. AI), he tried to weave them into the broader vision; when things fell apart (e.g. pivoting to the metaverse) he stepped back and regrouped. Instead of a single skill "have a twenty-year time horizon", he needed many smaller skills (coding, pitching to VCs, hiring) and meta-skills (learning new skills, combining skills, figuring out which skill to use).

If we take this seriously, we might expect progress in horizon length to be superexponential, as AIs start to figure out the meta-skills that let humans do projects of arbitrary length. That is, we would expect that it requires more new skills to go from a horizon of one second to one day, than it does to go from one year to one hundred thousand years; even though these are similar order-of-magnitude increases, we expect it to be easier to cross the latter gap. This superexponentiality is another potential reason why the 2024 - 2025 project is so much faster than the overall trend.

The most principled way to forecast AGI would be to figure out what all the various skills and meta-skills are, what horizon length represents mastery of those skills, and how we expect those skills to affect future progress. But we don’t know the answer to any of those questions, so we’re not going to be that principled.

METR’s solution was to pick the nice round number of one month:

We chose one month (approximately 167 working hours for a fair comparison with humans, since humans cannot work 24/7) for two reasons.
First, Ngo writes that a 1-month AGI (defined as an AI that outperforms most knowledgeable humans who are given 1 month of work hours, i.e. 167 hours, to perform the task) would necessarily exceed human performance both at tasks including writing large software applications or founding startups (clearly economically valuable), and including novel scientific discoveries.
Second, one month is around the period when new hires at a company begin to complete onboarding and generate economic value, and so an AI with a horizon of 1 month could be capable of acquiring context like a human employee, allowing it to complete high-context as well as low-context tasks.

METR, like ourselves, is interested in this question of when AIs can substitute for human programmers. Partly this is interesting because it might be the first big economic transformation - AIs seem on track to master coding before they master any other comparatively economically important sector. But partly it’s interesting because if there’s an intelligence explosion, superhuman coder AIs are probably when it starts in earnest.

If, like METR, you think a one-month horizon is a reasonable guess for economically transformative AI, you should expect to cross that threshold in a few years:

The longer 2019 - 2025 trend gets you transformative AI in 2030. The more recent, more speculative 2024 - 2025 trend gets you transformative AI in 20273.

Bells and whistles

We did the same calculation as METR, but ours was fancier. Where METR just extrapolated the trend line, we tried to forecast when the milestone would actually happen, given all the other factors involved. This took several adjustments.

First, instead of trusting any particular attempt to calculate the trend, we asked domain experts and superforecasters4 (people with a past track record of making good predictions) to look at all the paper as a gestalt and estimate the relevant parameters. For example, how long does it really take for horizons to double (Should we go with the overall trend of seven months? The more recent, faster trend? Something in between?). What horizon corresponds to our “superhuman coder” milestone? (Should we go with METR’s one month? Might it take longer? Does the horizon length on the METR task suite translate fluidly into real life?)

Second, instead of working with point estimates, we worked with probability distributions. For example, although one of our experts estimated that on average coding AIs would need a six week horizon to equal humans, he was able to be more specific by describing it as “a lognormal with 80% CI of [16 hours, 2 work-years (4,000 hours)].” You can see our full calculation here.

Third, we added a term for the possibility that the trend might change. One reason it might change is because the overall pattern is superexponential rather than exponential - the increased speed in 2024 - 2025 hints at this possibility, as does the “progress towards human-like unlimited horizons” argument discussed above. Another reason is the beginning of the intelligence explosion itself. Against these, there’s the usual low-hanging-fruit / diminishing returns argument that pushes in the direction of progress slowing down5.

Fourth, we added some extra terms for things like the delay between a company having an AI and deploying it externally.

Finally, METR was trying to estimate the arrival of “economically transformative AI”. But because we were interested in the beginning of the intelligence explosion, we estimated the arrival of “superhuman coders”, AIs capable of coding as well or better as the best humans (see the Defining A Superhuman Coder section of the Timelines Forecast for details). This is a higher standard than “mere” transformative AI, so we expected it to take longer time horizons.

When we ran the simulation, we got the following results:

Eli is our in-house forecaster. Nikola is a domain expert (member of technical staff at METR) who kindly agreed to help with our estimate. Despite major disagreements about some parameters6, they ended up with similarly-shaped distributions, both peaking around 2027.

Then we tried some other stuff that also peaked in 20277, so that was nice and convenient and we named the scenario “AI 2027”.

Nice graphs, but what does any of this mean?

To summarize: METR finds that AIs are able to do quick tasks, but not long tasks. But they’re getting better at long tasks! So far it looks like they can do tasks of about 15 - 60 minutes, and that number doubles somewhere between every 3 - 7 months. These numbers are most applicable to coding, and uncertainty increases the further we go from that domain.

What horizon length do AIs need to match or exceed human coders? Nobody really knows, but if we make them guess, their guesses are usually in the range of months. Probably when they’re better than humans at some things they’ll still be worse at others, but at horizons of a few months they’ll probably be around vague parity.

So when will we have superhuman coders? If you’ve read the first two words of our scenario, you won’t be surprised to hear that our best guess is 2027, although with big error bars and multiple asterisks.

We think these human-level coders will be enough to start the intelligence explosion, which is why we place it in 2027 or 2028.

We hope this gives you a sense of our methodology, and how we aspire to be more than just a cool story that someone made up. You can read the more detailed version of all of this in the Timelines Forecast and the Time Horizon Appendix.

Cotra uses a definition somewhat different from ours, which descends more closely from Richard Ngo’s work described here.

“AIs have horizons but humans don’t“ is true in real life, but isn’t true by the specific formal definition that METR used, because the humans in the METR experiments tended to give up after a while if they couldn’t solve the problem. Technically METR found that humans had a “time horizon” of one hour, barely better than their AIs! We think this is just a finding about humans’ unwillingness to spend months working on fake tasks in an AI experiment for limited compensation, not a deeper truth about human abilities. See Footnote 6 in the Timelines Forecast for more in-depth discussion of this problem.

METR’s quick simulation predicts transformative AI in 2027, and our more complicated simulation also predicts superhuman coders in 2027, but this is kind of a coincidence. Our endpoint (superhuman coder) is more advanced than theirs (transformative AI). But we also give more weight to the possibility of a superexponential trend than they do, so our simulation has progress going faster, and both simulations reach their respective endpoints around the same time.

Like “Xerox” or “Kleenex”, the word “superforecaster” is both a brand name (the forecasters identified by Philip Tetlock and the Good Judgment Project) and a generic term (forecasters of approximately this caliber, identified using similarly rigorous methods). Here we use the term generically.

But in our model, diminishing returns don’t start kicking in until compute scaling slows in 2029.

Although Nikola’s predictions were straightforward, Eli’s 80th percentile confidence interval for the time horizon needed for superhuman coders was 1,200 years! On its face, this seems surprising - although we call it the “superhuman” coder, it’s only slightly beyond the best humans, and humans obviously don’t make 1,200 year plans. Eli explained that he’s not certain whether the METR experiments are a good proxy for real life, and he thinks that it might take 1,200 year horizons according to METR to get the few-month real-life horizons that he thinks are necessary. See the Time Horizon Appendix for more.

In the full model, our mode is 2027 but our median is later. See the first graph at the very beginning of the Timelines Forecast. Based on a more complete analysis, Eli got an median of December 2028, Nicola of October 2027, and FutureSearch (a forecasting company - we took the average of their three members) of January 2032.