My Predictions for AI 2027: Metaculus Forecasting Questions

26 minute read

Published: August 15, 2025

The AI 2027 Tournament on Metaculus poses 16 questions about the near term future of AI. They are derived from the AI 2027 scenario and cover predictions about technological, economical, political and societal developments. I made predictions for 6 of the questions and want to share my reasoning. I believe there is a virtue in making public predictions, open my reasoning up to criticism and contribute to the discourse.

I spent 2-3 hours per question. That means I didn’t have time to research many subquestions, likely missed important considerations and had to be lazy in modelling. I allocated 30% to understanding the question and doing background reasoning, 15% to developing a modelling strategy or breaking down the question and 55% to researching and estimating specific sub-questions. I made these predictions between June 17 and July 17 2025, so by the time you’re reading this they might be outdated.

Here are the six questions I forecasted, with links to jump to each section:

What percentage of Americans will consider AI the most important problem in January 2028? - My forecast: 3%
What will be the highest average score on VideoGameBench on January 1, 2028? - My forecast: 80%
When will an AI model using neuralese recurrence be first released tothe public? - My forecast: October 2026
Will a paper with an AI as an author be published at NeurIPS, ICML, or ICLR before 2028? - My forecast: 84%
When will OpenAI first report AI Self-improvement risk levels? - My forecast: May 2027 (High), July 2032 (Critical)
Will the US and China reach a formal agreement to limit frontier AI training or deployment before 2029? - My forecast: 4%

1. What percentage of Americans will consider AI the most important problem in January 2028?

My forecast: Median 3%, community: 19%

My prediction distribution

Question Link

Question Details

This question resolves as the percentage of Americans mentioning AI, artificial intelligence, or “advancement of computers/technology” in response to Gallup’s long-running poll: “What do you think is the most important problem facing the country today?”.

Currently, “advancement of computers/technology” registers at less than 0.5% of responses and there is no category for AI. For context, climate change typically polls at 1%, and in May 2025, healthcare, education, and ethics/moral decline each registered around 2%.

My Reasoning

To answer this I made a baseline based on current value and similar issues and then estimated the expected change caused by some potential future events.

Prior: ~2%

Currently, the value for advancement of computers/technology is something <0.5%. Additionally we can look at related issues or issues that could be large scale catastrophes:

Climate change 1%
Pandemics 1%
Jobs 3%
Natural disasters <0.5%

Looking at these numbers I believe a 1% base rate seems reasonable. However, I’d like to account for another ~1% increase from slow and steady developments and further deployment of AIs over the next 2 years.

This leaves me with a prior of 2%.

Potential Updates: +0.25%

What kind of events could cause people to consider AI as a larger problem? What events could happen and what drives public attention?

Event	P(event) until 1.1.2028	Update from event	Comment
Large scale accident with human lives lost	7%	+2%	This might wake some people up and would give advocates something to rally around
Huge increase in activism on the scale of Climate change 2018	5%	+3%	A large scale social movement seems possible, but not likely to materialize in the next 2 years
Large job loss because of AI (eg over 5% of people laid off because of this)	8%	+3%	A lot of this would also go into the “employment/jobs” category
Clearly superhuman general capabilities	20%	+3%	IMO most people don’t follow technological developments closely
US-China explicitly enter into an AI Race, this becomes a central policy discussion point, president talks about this a lot	10%	+2%

Together this makes an increase of: +0.16.

This list is incomplete and I think I can come up with more reasons to get this to +0.3. However, I can also come up with some reasons to go down, maybe -0.05. This leaves me at +0.25.

Takeaways

If we take a baseline of 1% for similar types of risk, add 1% default growth and 0.25% from stuff I can enumerate, we get to ~2.25%.

I seem to disagree a lot with the community here. I don’t think this is related to speed of AI progress, as I try to take strong progress towards AGI within that timeframe seriously. Instead I think the difference comes from looking at how so many large issues score so low on this polling. “High cost of living/Inflation” was the biggest issue in the US election, but only gets 6%.

Biggest uncertainties:

Will AI cause significant job loss in the next 2 years?
Will there be a large social movement against AI?
General speed of development of AI

2. What will be the highest average score on VideoGameBench on January 1, 2028?

My forecast: 80% (median)

	lower 25%	median	upper 75%
Community	34%	58%	83%
My prediction	38%	80%	95%

Question Link

Question Details

This question resolves as the average of the best scores across six classic video games from the 1990s that make up the VideoGameBench benchmark:

Civilization I
The Incredible Machine
Pokemon Crystal
Doom II
Kirby’s Dream Land
Link’s Awakening (DX)

AIs receive only raw visual inputs and high-level descriptions of objectives and controls—no game-specific scaffolding or auxiliary information. The current best score is 0.95% by gpt-4o-2024-08-06 (0.9% in Pokemon Crystal, 4.8% in Kirby’s Dream Land, 0% in the other four games).

Some background

The paper notes that models perform much worse on VideoGameBench than on game-specific implementations like “Claude Plays Pokemon” or “Gemini Plays Pokemon”—this is because VideoGameBench prohibits game-specific scaffolding. This constraint will slow progress but not prevent it.

I’m guessing that more advanced models like o1/o3 could already achieve higher performance if tested. I’d estimate ~3%.

My Reasoning

To forecast this, I looked at three approaches: (1) baseline from how similar benchmarks develop over time, (2) an inside view of the specific bottlenecks and whether they’ll be solved, and (3) likelihood of specialized training on video games.

Baseline from benchmark progressions: ~90%

I selected benchmarks from EpochAI’s benchmark Dashboard that started below 5% and recorded their progression over 1-2 years:

Name	Duration (in years)	First score	Last score
Frontiermath	1	1%	19%
WeirdML	1	5%	61%
OTIS	2	3%	84%
AIDER	1	3.6%	80%

Average improvement is ~48% in the first year for benchmarks starting below 5%. Overall jumping to 19-84% seems reasonable. However, there’s selection bias here—we’re more likely to track benchmarks that show dramatic improvement. I’ll discount by 10% for this bias, giving ~38% expected improvement.

Taking this at face value as a linear increase we would get from 3% to 100% in 3 years. But this rarely happens on benchmarks and a sigmoid shape is more plausible. It’s a common pattern in benchmarks that capabilities quickly increase initially as key problems are solved, but it takes longer to iron out all problems to reach reliable performance.

I’ll take 90% as the baseline that comes out of this, but will hold it lightly.

Key bottlenecks: 57%

Key bottlenecks and likelihood of solutions:

Bottleneck	P(solved by 2028)	Impact if solved	Contribution
Multimodal information processing	65%	+40%	+26%
Reasoning in unknown environments	40%	+10%	+4%
Planning and memory management	80%	+30%	+24%

This gives an expected improvement of 54%. plus the current 3% = 57%.

Will LLMs be trained on video games? +23%

Two key factors will likely boost performance beyond the bottleneck analysis:

General video game training (70% likely): Frontier labs are increasingly using video games as showcase environments for agentic behavior (e.g. Claude Plays Pokemon) and will thus compete around pushing performance. Furthermore, video games can make for a good training ground for long-term agency. Expected reduction in the error rate: 30%. This improves performance from 57% → 66%.
Direct VideoGameBench training (40% likely overall):
- P(labs train on video games) = 70%
- P(they specifically include VideoGameBench they train on games) = 75%
- P(significant improvement they train on it) = 70%
- Overall: 0.7 × 0.75 × 0.7 = 37% chance of major improvement

This would further reduce error, increasing performance from 66% → 72%.

Takeaways

Taking ~ the middle between my outside view (90%) and inside view (72%), I arrive at 80%.

The community prediction is significantly lower than my estimation. This could be explained by me thinking that it’s likely LLMs will be specifically trained on Video Games. Furthermore, I might find it more likely than others that key capabilities are unlocked leading to huge jumps in the benchmark.

Biggest uncertainties:

Will labs specifically train on VideoGameBench or similar game benchmarks?
Is solving the key bottlenecks sufficient for stronger performance?
Does progress happen all at once?

3. When will an AI model using neuralese recurrence be first released to the public?

My forecast: October 2026 (median)

My prediction distribution

Question Link

Question Details

“Neuralese recurrence” means a model can carry its full or partial latent state across distinct forward passes, without encoding them into tokens. The model must be general-purpose (like GPT/Claude), at least GPT-4 level, and released to public users.

Some background

Currently, models lose information when encoding thoughts into tokens between forward passes (tokens carry ~16.6 bits vs thousands of bits in residual streams).

Meta published a paper implementing this idea in 2024, showing that models could maintain richer internal representations. There are also other groups working on this (eg Geiping et al). However, the technique hasn’t been adopted in frontier models yet, likely because it reduces training efficiency—you can’t parallelize token prediction when each token depends on the previous forward pass’s latent state.

My Reasoning

To forecast this, I combined (1) a base rate of how often AI innovations actually get implemented, (2) technical reasons why this specific innovation might be adopted faster or slower, and (3) external factors that could affect timeline.

Outside view likelihood of AI innovations being adopted: 50% by August 2027

I let Claude give 20 AI innovations that researchers predicted in 2021 would advance LLMs and judged that about 65% had been implemented in released models within 4 years. However, for the fact that Claude is more likely to name innovations that actually did get adopted I’ll reduce it to 55%. However:

Median implementation time was ~2 years (not 4)
We’re already 6 months after neuralese techniques were first used
AI development is accelerating (more techniques tried per year)

Based on these updates I’ll use 60% chance by January 2027 and thus conclude with a 50/50 chance by August 2026.

Inside view: Technical factors

Factor	Impact	Time adjustment	Reasoning
Information bottleneck	+40% likely	-1 year	Tokens only carry 16.6 bits vs thousands in latent states
Enables new algorithms	+10% likely	-2 months	E.g., breadth-first search not possible with standard transformers
Only useful for RL/post-training	Neutral	0	Doesn’t help with pre-training efficiency
Paradigm shift away from Autoregressive Transformers	10%	+6 months	Question might become irrelevant in that case

Other factors

Factor	Probability	Time adjustment	Reasoning
Safety regulation against it	15%	+3 months	Could mandate token-based reasoning for interpretability
Internal safety advocates delay	10%	+1 month	Would need to happen at every lab
Paradigm shift away from Autoregressive Transformers	10%	+6 months	Question might become irrelevant in that case eg if there is a shift to Diffusion based LLMs
Labs already working on it	+20% likely	-3 months	Meta published, others likely experimenting with it atm

Another larger point is that it will become significantly cheaper to train GPT-4 level models. Thus more actors (potentially including academic labs) could experiment with new techniques for training GPT-4 level models. There are currently 20 actors that have trained GPT-4 level (the earliest version) models (a total of 38 unique models). ~5 of them are pushing new research agendas. I think it could be much more by ~2027. => -2 months

From baseline to final forecast

Starting from August 2026 baseline and applying the adjustments I arrive at April 2026.

However, this feels too aggressive to me. The “neuralese doesn’t make sense for pre-training” is a significant barrier that I underweighted. Adding 8 months for conservatism → October 2026

Takeaways

My median of October 2026 is relatively aggressive compared to the community’s December 2027. This reflects my belief that many more techniques will be tried at GPT-4 level as capabilities become cheaper and that the information bottleneck problem is important.

Biggest uncertainties:

How easy will it be to try things on GPT-4 level models in the future?
Is the information bottleneck problem important?
How much of future training will be RL/post-training vs pre-training?
Will safety concerns successfully delay deployment?
Could a different architecture achieve similar benefits without the “neuralese” label?

4. Will a paper with an AI as an author be published at NeurIPS, ICML, or ICLR before 2028?

My forecast: 84%

Question Link

Question Details

The question resolves Yes if a paper is published at one of these top ML conferences with an AI as author, OR if authors state that an AI performed at least 25% of the cognitive work.

“Cognitive work” includes idea generation, experiment design, analysis, etc.—not just code or text writing.

Some background

AI Scientist, a system developed by Sakana AI, has already achieved 1/3 acceptance rate at ICLR workshops. The complete research loop of idea generation, literature search, experiment design, analysis, and writing is already functional in prototype systems.

The conferences all prohibit listing AI as a formal author but would allow disclosure that AI performed significant cognitive work. The key question is whether models will become capable enough and whether authors will actually use and disclose such AI assistance.

My Reasoning

To forecast this, I (1) made an intuitive guess, (2) estimated how much AI could contribute to different aspects of research by 2028, and (3) considered meta-factors like disclosure incentives and lab training priorities.

Current baseline: From workshop to conference papers

With AI Scientist already achieving 33% acceptance at ICLR workshops, the gap to conference papers seems surmountable. Based on vibes, I’d estimate:

75% chance by end of 2026
85% chance by end of 2027
95% chance by end of 2028

Breaking down research tasks: Can AI do 25% of cognitive work?

I analyzed each component of empirical ML research, estimating: (a) what percentage of total cognitive work it represents, (b) how much AI could handle by 2028, and (c) likelihood the highest-AI-use paper would leverage this. Since one paper is sufficient I will forecast the probability that the paper with the highest-AI-use will have >25% of its work done by AI (this is equivalent to the original question).

Task	% of work	AI capability by 2028 (median)	P(used in top paper)	Contribution
Idea generation	15%	85%	60%	7.7%
Literature review	10%	85%	70%	6.0%
Experiment design	10%	50%	70%	3.5%
Engineering/Architecture	40%	70%	90%	25.2%
Experiment analysis	10%	40%	70%	2.8%
Writing/Visualization	15%	70%	90%	9.5%

Total expected AI contribution: 54.7% of cognitive work. To get to a likelihood of P(contribution >25%) I estimated normal distributions for each task and arrived at P(contribution >25%) = 97%. This ignores that these capabilities are correlated, so I will reduce the chance to 87%. Clearly this is too high so I will apply several discounts.

Adjustments and meta-factors

Factor	Impact on probability	Reasoning
Coding might not count as cognitive	-9%	Resolution criteria ambiguous about engineering
Papers might be rejected	-5%	Conferences may resist AI-heavy papers
Underreporting due to stigma	-8%	Authors want credit, community skepticism
Conservative resolution	-6%	Metaculus might interpret strictly
Theoretical papers easier to automate	+9%	AIs are already making novel proofs
Position/survey papers easier to automate	+7%	Easiest to automate, though only 2% of papers
Labs training for this capability	+5%	OpenAI’s RepliBench, clear commercial interest

Net adjustment: -7% from the 97% baseline → 80%

But given the rapid progress and the fact that only ONE paper needs to meet the threshold, I’ll be slightly more optimistic → 84%

Takeaways

My 84% forecast is notably higher than the community median of ~70%. This difference likely stems from:

My believe that theory-heavy papers and survey papers are likely to be cracked first
It’s enough for 25% to be done by AI and doesn’t require automating the full research process

Biggest uncertainties:

Will conferences explicitly ban or discourage AI contributions?
How strictly will “cognitive work” be interpreted vs. mechanical tasks?
Will authors honestly disclose AI usage given career incentives?

5. When will OpenAI first report AI Self-improvement risk levels?

My forecasts:

High threshold: May 2027
Critical threshold: Jul 2032

Question Link

Question Details

OpenAI’s Preparedness Framework v2 defines risk thresholds for AI Self-improvement capability:

High threshold: “The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant”
Critical threshold: “The model is capable of recursively self improving… either a superhuman research scientist agent OR causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time”

The question resolves when OpenAI officially reports reaching these thresholds, or if two credible sources confirm it.

Some background

OpenAI previously reported that o3-mini and deep research achieved the “Medium” risk threshold on their older “Model Autonomy” category. The High and Critical thresholds represent significant jumps in capability—from useful assistant to transformative research acceleration.

METR’s evaluations suggest steady progress on coding tasks. Their trends imply that by November 2026, models will handle 2:40h ML engineering tasks with 80% success. This trajectory points toward the capabilities needed for these thresholds.

My Reasoning

To forecast this, I (1) estimated what capabilities would meet each threshold, (2) projected when those capabilities will exist, and (3) adjusted for OpenAI’s reporting incentives and delays.

High Threshold: What capabilities are needed?

A “highly performant mid-career research engineer assistant” would roughly:

Double researcher productivity
Enable tackling more technically difficult projects
Handle multi-hour coding tasks reliably

Capability	Current state	Timeline to threshold
90% reliable 2-hour coding tasks	Extrapolating METR task length trends	October 2026
Significant acceleration on key tasks eg using new libraries, quickly getting relevant papers, generally speeding up coding, improving idea generation	For me I already get 1.5x speedup from LLMs now, so I think a 2x speedup within one year is plausible	August 2026

Averaging these estimates → September 2026 for capability achievement

Political and reporting adjustments for High threshold

OpenAI has an incentive to trigger this threshold early to feed an “AGI is coming” narrative and thus increase investment and use (-1 month). However, triggering the threshold would be costly, since it means they have to employ expensive mitigations, possibly delay launches and could get political backlash (+4 months). Furthermore, I believe OpenAI will be more conservative than me when interpreting these criteria (+3 months). Lastly, once the threshold is met there might be a delay until it’s publicly announced (+2 months).

Net adjustment: +8 months from September 2026 → May 2027

My prediction distribution

Critical Threshold: What capabilities are needed?

The critical threshold has two ways to qualify:

Superhuman research scientist agent
5x speedup on generational improvements (o1→o3 in 4 weeks)

The 5x speedup seems like the lower bar—parallelization and speed advantages could achieve this without full autonomy. For example it could be achieved by a 4x speedup in implementation and a 1.5x speedup in research taste.

Finally, I ran out of time and just made a guess. Here I created a bimodal distribution. The first represents a world in which a scale up for ~current methods gets us to the specified capabilities. The second one represents a world where new breakthroughs or paradigms are needed.

My prediction distribution

Takeaways

For the High threshold (May 2027), I’m relatively confident based on clear capability trajectories. The critical threshold is much more uncertain—it might come surprisingly quickly if current scaling continues, or might require fundamental breakthroughs.

My estimates are earlier than the community median, likely because I already experience significant speedups in my research process from LLMs and think a 2x boost in productivity is not far off. It’s also notable that the community predicts less time between the thresholds. This might reflect a “boom or bust” model I have in my head where AGI progress either happens through scaling up current methods in the next 5 years or takes a longer time for other breakthroughs.

Biggest uncertainties:

How strictly will OpenAI interpret “equivalent to mid-career engineer”?
Will scaling up ~current techniques lead to superhuman coders?

6. Will the US and China reach a formal agreement to limit frontier AI training or deployment before 2029?

My forecast: 4%, Community: 10%

Question Link

Question Details

The question resolves Yes if the US and China reach a formal mutual agreement that limits AI training or deployment above a certain capability level or compute threshold. The agreement must impose specific limitations on AI above a threshold, not just prohibit certain uses of all AI.

Some background

Comparable international agreements exist for other dual-use technologies. Nuclear arms control treaties, naval tonnage limitations, and even a US-China agreement on cyber-enabled IP theft show that adversaries can reach agreements when mutual vulnerability exists. However, AI verification is harder than monitoring nuclear weapons or battleships.

Currently, there is a US-China dialogue on artificial intelligence where officials have discussed AI risks, plus track-2 diplomatic dialogues. This provides some foundation, though far from treaty negotiations.

My Reasoning

To forecast this, I (1) established a base rate from similar treaties and important success factors for treaties, (2) answered some questions I found relevant, and (3) attempted to squish things into a probabilistic model.

Base rate from comparable treaties: 25%

Looking at historical precedents:

Nuclear arms control has seen multiple treaties. It’s relevant as it poses a similar existential risk
Other weapons control treaties include restricting the tonnage of battleships or ballistic missiles
There is a China-US agreement against cyber-enabled IP theft

According to Claude important factors for the success of a treaty are:

Mutual vulnerability drives deals: Most successful when both sides feared the technology. This could be true for AI given significant advances, but is currently not the case.
Verification is crucial but flexible: “National technical means” allowed sovereignty
- AI is harder to enforce than physical military technology. Chip security features are a possible avenue
Quantitative thresholds work: Specific numbers (warheads, tonnage) easier than qualitative limits
- Compute thresholds or clear allowences what AI can or can’t be used for could provide clear red lines, but are less measurable or directly relevant than physical limits
Crisis catalyst effect: Cuban Missile Crisis → Test Ban Treaty
- There might be clear infliction point (40%)

Informed by this I will make up a 25% prior.

Will negotiations be attempted?: 10%

For serious negotiations to occur, several factors must align:

Factor	Probability	Reasoning
AI becomes clearly dangerous	30%	Requires visible near-misses or accidents
AI salient to policymakers	50%	Currently low but could change rapidly
Both prefer cooperation over racing	40%	Depends on which internal narratives win out

These factors are correlated. I estimate 25% chance they align sufficiently. Given alignment, 40% chance they attempt formal negotiations. Combined: 25% × 40% = 10% chance negotiations are attempted

Will negotiations succeed?: 40%

If negotiations are attempted:

Factor	Impact on success	Assessment
Historical treaty success rate	Baseline 45%
US willingness	60%
China willingness	70%	China is likely going to be behind in AI development and thus would benefit from a slowdown
Both US and China willing	45%
Trump presidency	-10%	Less likely to make deals
Clear US lead via chip controls	-10%	Reduces US incentive

Adjusted probability of success: 40%

Final calculation

P(treaty) = P(negotiations attempted) × P(success

attempted) = 10% × 40% = 4%

Takeaways

My 4% forecast is significantly lower than the community’s 10%. This reflects my skepticism that policy makers will be sufficiently aware of AI risk by then and the bad US-China relationship, which is unlikely to change in a Trump presidency.

Biggest uncertainties:

Will there be a catalyzing “AI crisis” before 2029?
How will the US-China tech competition evolve?
Can technical solutions enable credible monitoring?

Final Thoughts on the value of Forecasting AI developments

I think forecasting AI developments is very educational and a prosocial activity. While I don’t think any of my numbers should be taken to base important decisions, I have learned a lot. I picked questions I find important and having predictions about these questions informs my worldview. But more importantly Forecasting forces you to develop deep mental models about a question. And these mental models can then be used in the future when you are thinking about AI developments. Thus they make you overall smarter at thinking about AI!

It was especially useful to do forecasts in parallel with a friend and discuss our approaches afterwards. This gave me quick feedback on my methodological approaches and pointed out which considerations I had missed. I highly recommend it! Thank you to Charles for being my Forecasting partner.

Additionally, by contributing to the aggregate Metaculus prediction scores and publishing my reasoning here, I hope I can help to improve the information on which others make decisions and contribute to the broader conversation about AI development.

Share on

X (formerly Twitter) Facebook LinkedIn

Jan Wehner

My Predictions for AI 2027: Metaculus Forecasting Questions

1. What percentage of Americans will consider AI the most important problem in January 2028?

Question Details

My Reasoning

Takeaways

2. What will be the highest average score on VideoGameBench on January 1, 2028?

Question Details

Some background

My Reasoning

Takeaways

3. When will an AI model using neuralese recurrence be first released to the public?

Question Details

Some background

My Reasoning

Takeaways

4. Will a paper with an AI as an author be published at NeurIPS, ICML, or ICLR before 2028?

Question Details

Some background

My Reasoning

Takeaways

5. When will OpenAI first report AI Self-improvement risk levels?

Question Details

Some background

My Reasoning

Takeaways

6. Will the US and China reach a formal agreement to limit frontier AI training or deployment before 2029?

Question Details

Some background

My Reasoning

Takeaways

Final Thoughts on the value of Forecasting AI developments

Share on

You May Also Enjoy

Learning from the Luddites: Implications for a modern AI labour movement

30.9.2045

Envision paradise in the face of catastrophe

15 Levers for influencing Frontier AI Companies