34 Comments
User's avatar
Johan Falk's avatar

Just wanted to say that the work you do is important and appreciated.

Girish Sastry's avatar

What accounts for the difference between Eli and Daniel? Are there a few key disagreements (and how large are those disagreements)?

Eli Lifland's avatar

Regarding explicit model parameters, the most important differences for AC timelines are:

1. Daniel's median present doubling time being 4 months rather than 4.5 months. My sense is that this is Daniel giving a bit less weight than me to a reversion to something in between the current trend and older, slower trend.

If we make this change, it moves AC with Daniel's median parameters back from 06/2028 to 10/2028 (see https://www.aifuturesmodel.com/p?base=daniel-04-02-26&pdt=0.3748905706229262)

2. Daniel has a median estimate of 1 year for the 80% time horizon required for AC, while I have 125.

Adjusting this moves AC from 10/2028 to 12/2029: https://www.aifuturesmodel.com/p?base=daniel-04-02-26&pdt=0.3748905706229262&acth=7.197271450951031. Which is nearly exactly the same as my AC with median parameters of 11/2029 (https://www.aifuturesmodel.com/p?base=eli-04-02-26)

My sense is that this disagreement is mostly driven by (a) Daniel thinking that lower reliability/success rate is required for AC than I do (b) Daniel thinking that the relevant tasks that AC needs to automate take less time than I do, in part because of differences around how to count time for tasks that are done by many humans in parallel. Also Daniel has some reasoning about the AC only needing to automate long enough tasks to reach escape velocity where it's getting better fast enough that the tasks it needs to do are always shorter than its time horizon, which I don't feel like I fully understand.

You can see each of our rationales at https://docs.google.com/document/d/1ru6Okbxb6XuH18Cz8439sdQJazMV39hNxsWDokh97r0/edit?tab=t.0#heading=h.ga01t71wyiv7

Regarding other AC timelines considerations that creep into our all-things-considered view, my guess is that the biggest difference is a pure vibes-level difference where it feels intuitively to Daniel like extrapolating progress in coding agents leads to AC in 2027 or maybe 2028, while my vibes say it will take longer.

Regarding takeoff post-AC (especially post-SAR), the biggest driver of differences is that Daniel has a median automated research taste slope (on the homepage, "How quickly AIs improve at research taste") of 3.0 while I have 2.1. As you can see in https://www.aifuturesmodel.com/analysis this is by far the most important parameter for takeoff. And if I only adjust that parameter of Daniel's to match mine, AC->ASI takeoff goes from 1 to 1.75 years. I have a long writeup/spreadsheet which explains where I got 2.1, and Daniel wrote a short blurb about why he sets it higher: https://docs.google.com/document/d/1ru6Okbxb6XuH18Cz8439sdQJazMV39hNxsWDokh97r0/edit?tab=t.0#heading=h.y0yy6iou4a4q

Pranay Agrawal's avatar

Love the intellectual honesty 🙏.

Will Kiely's avatar

> The AC milestone is the point at which an AGI company would rather lay off all of their human software engineers than stop using AIs for software engineering.

This operationalization seems unideally ambiguous to me, such that I'd be surprised if this was the best operationalization of "AC" to focus your forecasting efforts on.[1]

I'm sure you've considered this, but wanted to flag this anyways.

[1] It's a purely hypothetical decision that no company actually faces, and if there's a dispute about whether AC has been reached or not (even between Daniel and Eli), you can't even say "well let's imagine what would happen if the AGI company made each choice" as a way to resolve the dispute, since presumably the company's development would suffer significantly and it would fall behind and go out of business regardless of which choice it makes--unless you're *way past* the AC milestone, in which case sure, laying off all the human SEs is fine, but the point is for the milestone to be able to measure the threshold, not the point where you're so far past it that obviously you'd rather lay off all the human SEs. Wait, or is this what you meant by the milestone? The point where the company would *obviously* rather lay off the human SEs because they aren't really needed anymore? If so, I still think it's a little too ambiguous. There's a spectrum from "laying off all our human SEs would be catastrophically harmful to us as a company" to "it'd be a big setback, but we could manage it" to "we fired them already because paying them to do software engineering wasn't worth the money". And unless companies go through that whole spectrum in a short period of time, then what point is meant by your AC milestone is still unclear to me.

Daniel Kokotajlo's avatar

I acknowledge those weaknesses in the milestone; got a better idea? We considered various alternatives which all seemed worse.

1123581321's avatar

Did you consider "when AI_spend reaches a certain fraction of the company payroll" as a milestone? Layoffs are a terrible metric because of all the labor laws, etc., but when you see AI token spend approaching SE salary spend you know the company values "AI coders" as much as human ones.

Will Kiely's avatar

Haven't thought of a better idea yet, but I did just see that Ajeya Cotra just published a very relevant post after yours -- your AC milestone is "parity" in hers, except you only remove software engineers instead of all technical staff: https://plannedobs.substack.com/p/six-milestones-for-ai-automation?utm_source=share&utm_medium=android&r=lsiv

Nebu Pookins's avatar

Perhaps low priority and you guys have more important things to do, but it sure would be swell if you could update https://forecast2026.ai/ with the Q1 updated current values so we could see how our predictions were faring so far.

Zeb Camp's avatar

Is there any planned publication date for the next planned scenario that ya’ll would be able to share?

Daniel Kokotajlo's avatar

We have a draft but it's unclear how long it'll take to be ready to publish, sorry!

Zeb Camp's avatar

All good. I’m grateful to be able to read you guys’ hard work for free. I wish ya’ll luck

Alexander Goncharov's avatar

What's the plot of the scenario? Since you have a draft already you can say at least that.

Will Kiely's avatar

> We operationalize the term "AGI" as TED-AI, which is defined as: AIs at least as good as top human experts at virtually all cognitive tasks.

Is cost part of this definition? E.g. Would an AI that can complete any cognitive task at least as well as a top human expert but at a cost that is often much more expensive than just hiring the human (say it takes $1M for the AI to do a task that takes the human expert a day) still qualify as TED-AI?

Daniel Kokotajlo's avatar

Yes, if it's as much or more expensive to use the AI than the human, it shouldn't qualify, I think.

Woody Zen's avatar

6 AI leaders' AGI timeline predictions vs your updated forecast:

Musk: 2026 (25% accuracy, FADE)

Altman: AI researcher by Mar 2028 (73%)

Amodei: virtual colleague 2026 (58%)

Jensen: 5 years = 2030 (61%)

Hassabis: 5-10 years

LeCun: skeptic (62%)

Your mid-2028 AC median aligns closest with Altman, who has the highest accuracy in this group.

Data: ClaimClock

Max's avatar

How did you get this data? I am unable to find the claimclock website.

Woody Zen's avatar

Thanks for asking! This is a side project I built to track prediction accuracy of public figures, mainly about how often their forecasts actually come true or false. Right now it covers 27 figures across AI/tech, geopolitics, economics, and crypto, and update every day.

No public website yet, still in early development. But I share data through a Discord bot and publish articles almost every week. Happy to share more if you're interested.

Matthew Hutson's avatar

In predicting the arrival of AGI, you operationalize AGI as “AIs at least as good as top human experts at virtually all cognitive tasks.” Without operationalizing “virtually,” this definition is useless.

It’s like predicting when spaceships can go “almost” the speed of light. Energy required approaches infinity; 99% is much harder than 98%.

Daniel Kokotajlo's avatar

How about if we deleted "virtually?" Just "All cognitive tasks." It's cleaner.

The intention was for it to be "all" cognitive tasks, but then to make exceptions for obvious "gotcha" exceptions, such as "knowing what it's like to be human" or "behaving exactly like a human would" etc. Maybe we could say "All cognitive tasks of interest" and then the protocol is, if we want a task to count as an exception, we have to argue that it isn't of interest / isn't relevant to the broader points we are making, and if we fail to do so, then it doesn't count as an exception and it's included in the definition of TED-AI.

Thoughts?

Matthew Hutson's avatar

Thanks for your reply. For me, “all cognitive tasks” doesn’t work, because it’s not a reachable goal. Humans don’t even beat insects at all cognitive task (for example involving aspects of coordination, navigation, and communication). People also disagree on what counts as “cognitive.” “All cognitive tasks of interest” raises the additional question of what’s “of interest.” I think to name a specific AGI date, you need a concrete AGI benchmark. See: https://spectrum.ieee.org/agi-benchmark

Daniel Kokotajlo's avatar

Thanks for your thoughtful engagement!

I think it's important to distinguish between definitions of capabilities milestones that we care about (such as AGI, ASI, TED-AI, AC, ...) and definitions of easily-observable/measurable proxies for those milestones (e.g. benchmarks, resolution conditions, tests....)

What we are doing here is the former, not the latter. Once we have decided what capabilities milestones we care about, and made our forecasts for when they will occur, we then turn to the task of figuring out how to measure/test whether and when those milestones occur, by constructing proxies / operationalizations such as suites of benchmarks.

Re: "of interest" Yeah, that's why I suggested a protocol above for answering the question about what's "of interest."

Matthew Hutson's avatar

I think you're doing it backwards. I don't know how you can put a specific date to a milestone before working out what exactly the milestone entails.

It's like predicting on what date a growing pile of stuff will become "big"—without deciding how to measure its size (height? mass? volume?) and what the threshold on that metric is.

Daniel Kokotajlo's avatar

I'm not doing that. Step one is working out what exactly the milestone entails -- defining the capability of interest. That's what the definition of TED-AI is trying to do. Step two is thinking about what proxies/evidence/etc. we could gather that would let us know whether the milestone has been reached. That's what various benchmarks etc. might be useful for.

Matthew Hutson's avatar

Could you clarify? It seems like you *are* doing it backwards. You've already predicted a date for reaching the milestone, but you're still working out what the milestone entails. Sorry if I'm misunderstanding you.

Jobbley's avatar

I agree here. The definition is too ambiguous. It would definitely be appreciated to understand what "in virtually all tasks" actually means. This problem is also present in the ASI definition of "2x the gap between top expert and median professional", where there is no real quantified definition of the gap between a top expert and median professional. Is this from a pure problem solving perspective as in, the ability to manipulate existing physics equations for example? Or in conceptual understanding of why the equations work which would indeed make the ASI much more advanced in creativity as well compared to a human?

We might need to develop techniques beyond the current generative neural networks to truly even get to ASI which uses cognition in a comparable way to an advanced biological organism.

Also, there will most likely be a shift for human labor away from automated tasks that an AGI or ASI might replace into cognitive tasks it can't for some time, including new human supervisory roles.

I admire and hugely respect the work you guys do regardless, but would like to see some more rigour in these.

QXR's avatar
Apr 20Edited

Per https://www.tobyord.com/writing/hourly-costs-for-ai-agents , the cost to reach the advertised time horizon of an AI agent has been growing, and it's possible this is growing faster than the rate at which time horizons are growing (i.e. cost ratio of AI to human at the time horizon could be increasing). Where, if it is, is this accounted for in the AI Futures Model?

My first guess would've been that it could be added as a correction to how quickly inference efficiency is getting, so I looked at the section for coding automation efficiency improvement factor (https://docs.google.com/document/d/1ru6Okbxb6XuH18Cz8439sdQJazMV39hNxsWDokh97r0/edit?tab=t.0#heading=h.endn64m955cs ), but I didn't find anything there.

I also checked the time horizon extension model (https://ai-2027.com/research/timelines-forecast#method-1-time-horizon-extension ), but this doesn't seem to have a term accounting for increasing costs of AI agents to reach the time horizon. There's a term that adds a few months to account for the need to drive down the cost of AI agents to 30x cheaper than humans, but it seems to assume that the relative cost of an AI agent and a human stays at the original value of 5~10x cheaper than humans.

(Incidentally, the time horizon extension model cites Figure 13 in the METR paper, but the current version of the paper seems to have moved it to Figure 14.)

Peter Som de Cerff's avatar

I like your surpassing-skill-coding metric. Not only is it a lot less fuzzy than AGI as a term, it's directly applicable to a lot of the real world (where SW eats the world now, and AI should lower the cost for a truly amazing Jevon's Paradox blast). Most importantly, it will usher in self-improving AI, as the AI companies will be using the pre-release best version to build the next AI and it's better version...and once the self-improving flywheel spins, all the rest of AGI is likely to come along soon enough anyway.

I just had my multi-agent personal too refactor a small project I built with AI a year ago. This time a lot of bugs got fixed (some I hadn't noticed), some features added, and it all just worked without me having to look at any code at all (which wasn't the case a year ago). I'd say we're getting pretty close already for non-mission-critical personal apps.

Flip side is that I still had to outline the functionality I desired. You can argue that this is the whole point (it is for me), but when AIs brainstorm the whole project feature set, the world will become, uh, more interesting, and fast!

Max's avatar
Apr 6Edited

Great post! Very informative. I did have some questions and concerns.

(1). You both mention that “… some AI company researchers that we respect continue to say that automated AI R&D is coming soon; sooner, in fact, than we ourselves think. Rather than walking back their predictions, they are doubling down, both in public and in private discussions. While we don’t put too much weight on such claims, noting that many other researchers have longer timelines, it does count for something.” I am wondering why you both don’t put too much weight on these claims?

(2). I noticed on the interactive AI Futures Model: Forecast page: https://www.aifuturesmodel.com/forecast/daniel-04-02-26?show=atc, that Daniel’s modal years for the superhuman coder is October 2027, the superhuman ai researcher is August 2027, superintelligent ai researcher is November 2027, TED-AI is November 2027 and ASI is January 2028. Daniel doesn’t the superhuman ai researcher come after the superhuman coder? And doesn’t the superintelligent ai researcher come after TED-AI?

(3). Where do both of you stand in regard to the likelihood of “a drop in remote worker”?

Kevin E Levin's avatar

The shift from a 5.5 month doubling time to 4 months is the detail that keeps pulling me back here. That's not a small recalibration, the trend itself is accelerating. The Claude Code revenue number ($2.5B annualized in 9 months) is useful real-world evidence to pair with the METR data. Appreciate the quarterly updates, makes it much easier to track how the evidence is actually moving.

**********'s avatar

Terrifying and substantial, but not total (as you guys are aware). Excuse me while I obsess over this for the next week, observing everything I can and discussing with others.