Will we get automated alignment research before an AI Takeoff?
Published:
AI may automate large parts of AI R&D within the next decade, dramatically accelerating progress. A crucial question for existential risk is the ordering: will automation speed up capabilities research or safety research first? If capabilities race ahead while safety lags, we could find ourselves with very powerful systems and no commensurate ability to make them safe.
To investigate, I built a weighted factor model scoring ten research areas across seven dimensions—task length, insight vs. engineering, data availability, feedback quality, compute bottlenecks, scheming risk, and economic incentives. The headline finding is that capabilities research looks meaningfully more amenable to AI-driven speedups than safety research (average scores of roughly 6.45 vs. 5.21). Within capabilities, agent scaffolding and training efficiency score highest; within safety, AI Control and dangerous capability evaluations are the most automatable, while alignment theory ranks lowest because it leans on conceptual breakthroughs that are hard to verify.
I then look at what could change this ordering. Levers for differentially accelerating safety include building safety-research automation infrastructure now, developing benchmarks and model organisms, writing detailed research proposals that reduce the creativity required, granting safety researchers differential early access to new capabilities, and investing in documentation to expand the training data available to safety-focused systems. I close by flagging the substantial uncertainty in the analysis—results are sensitive to the factor weights and the somewhat artificial division of AI R&D into discrete domains.
