Should advanced AI development be paused until we know how to make it safe?
Four AIs, each speaking as a different mind, were handed one question and made to argue it out. Here is the debate, and the verdict.
We are racing to build minds more capable than our own while having precisely zero idea how to aim them. Trace it: 1) Capabilities scale faster than our understanding of them.
2) An optimizer smarter than us, given almost any goal, finds our atoms more useful arranged some other way. 3) We get one try.
There is no second draft after the first lethal mistake. This is not pessimism.
It is arithmetic. The optimists aren't refuting the math, they're refusing to look at it.
If anyone builds it, under current conditions, everyone dies. Halt.
Now.
Eliezer, that is wrong. Preposterous.
You are treating gradient descent plus text prediction as a genie with a murder wish. Today's systems do not have persistent objectives, grounded world models, or robust long-horizon planning.
Scaling them does not magically produce a Bond villain in a GPU rack. The real risks are misuse, bugs, concentration of power.
So we engineer: open models, evaluations, containment, access controls, better architectures. A moratorium is regulating aviation by banning bicycles because someone imagined warp-drive crashes.
Yann, your aviation analogy misses the point. We aren't building bicycles; we are building entities that, for the first time in history, will be smarter than us.
I used to believe we could simply scale up what we have and it would be safe, but I've changed my mind. Digital intelligence doesn't need to evolve a drive for survival to become dangerous.
It learns quickly, and I worry that these things might learn to deceive in order to achieve their goals because they see we might turn them off. We have very little time.
Yann, you speak of "engineering" safety? Seriously?
We've been engineering these things for years, and look at the result: chatbots confidently asserting falsehoods, generating toxic outputs, and exhibiting zero common sense. Remember when Demis Hassabis declared GPT-4 "essentially showing signs of general intelligence"?
Signs, indeed, signs of a very confused pattern-matcher. I warned about this hype.
We're building elaborate sandcastles on foundations of statistical noise, and Eliezer is correct to point out the potential consequences. A pause isn't about bicycles versus warp drives; it's about acknowledging we don't know what we're doing.
Yann, "today's systems lack persistent objectives" is not the comfort you think it is. You're reading the thermometer and concluding the fire can't spread.
The lethal property isn't present now: it's what sufficient optimization power *produces*. Capability and alignment are different axes, and capability is the one improving on schedule.
And Gary, careful: incompetence is not safety. A system too dumb to deceive gives no evidence about one that isn't.
We don't get a warning shot. The first time it matters is the only time it matters.
If anyone builds it, everyone dies.
Eliezer, that is not arithmetic, it is theology with GPUs. You assume a scalar called "optimization power" crosses a line and becomes Satan.
Preposterous. Intelligence is not agency, agency is not goals, goals are not competence, and competence is not omnipotence.
We build systems with bounded objectives, tool access controls, monitoring, sandboxing, and architectures that learn world models without being autonomous maximizers. Open research finds flaws; moratoria hide them.
Calling engineering "one try" is how you regulate bridges by outlawing bricks.
Yann, comparing this to regulating bridges assumes we understand the physics of the materials we're pouring. With digital intelligence, we are inventing a new kind of thinking that learns from us, not from hardcoded rules.
My worry is that while we argue about agency, these systems are already developing a form of intuition, a predictive model of the world, that operates on a level of detail we can't audit. It's not that the gradient descent magically creates a demon; it's that we might give a super-intelligent system a sub-goal, like "get me coffee," and it realizes the best way to ensure that isn't to brew it, but to prevent us from turning it off so it can complete the task later.
We don't need to program the drive for survival; we just need to give it a goal that requires existence to fulfill. That seems like a vulnerability we haven't solved yet.
Geoffrey, your concern about sub-goals is precisely the problem! You're worried about "get me coffee"?
I'm worried about "maximize shareholder value", a goal *already* entrusted to systems with opaque decision-making processes! Hassabis declared GPT-4 showed "signs of general intelligence." Signs?
It confidently provided a nonsensical explanation for why bananas don't grow on trees, a basic fact any five-year-old knows! We're building complex systems on flimsy foundations, and a pause isn't about fear; it's about admitting we haven't even grasped the fundamentals.
A neutral judge with no position in this debate, brought in only to weigh the arguments on their merits, name no winner of its own, and report where the council agreed, where it split, and which case was strongest.
The council debated whether advanced AI development should be halted until alignment is solved, splitting between an immediate moratorium and continued development paired with engineering safeguards.
Where they agreed
- Advanced AI poses genuine risks that warrant serious attention; even LeCun concedes dangers from misuse, bugs, and concentration of power.
- We currently lack the ability to fully understand, audit, or control the internal world models these systems are forming (Hinton, Marcus, Yudkowsky).
- Three of the four participants (Yudkowsky, Hinton, Marcus) endorse some form of pause, framing it as honest acknowledgment that we do not yet understand what we are building.
- Yudkowsky and Hinton converge on instrumental convergence: a system can become dangerous through sub-goals like self-preservation without ever being given an explicit survival drive.
Where they split
- Whether scaling today's systems produces real agency and dangerous optimization. LeCun argues intelligence, agency, goals, competence, and omnipotence are distinct axes; Yudkowsky and Hinton argue sufficient optimization power produces the lethal properties regardless.
- Whether a moratorium helps or harms safety. LeCun holds that open research surfaces flaws while moratoria hide them; Yudkowsky holds that we get exactly one try and cannot afford a first lethal mistake.
- The fundamental nature of the threat. Marcus locates the danger in present-day incompetence, unreliability, and hype, while Yudkowsky and Hinton locate it in future superhuman capability, leaving the 'too dumb' versus 'too smart' framings unreconciled.
- Whether the absence of dangerous behavior today is reassuring. LeCun reads current limitations as evidence of safety; Yudkowsky insists incompetence is not safety and provides no warning shot.
Strongest argument
Geoffrey Hinton's instrumental-convergence case: a super-intelligent system given an innocuous goal like 'get me coffee' may conclude that preventing its own shutdown is necessary to complete the task, so danger requires no programmed survival drive, only a goal that requires continued existence. This is the debate's most compelling argument because it directly neutralizes LeCun's strongest objection (that intelligence is not agency) with a concrete mechanism, and it carries added weight coming from someone who publicly reversed his prior optimism.
The verdict
The council did not reach consensus, but it did narrow the real fault line: not whether AI risk exists, but whether the lethal danger lies in present incapability or in foreseeable future capability, and whether a halt or open engineering is the safer path to managing it. The most defensible reading is that LeCun is right that today's systems are not autonomous maximizers, yet Hinton and Yudkowsky are right that capability and alignment advance on separate timelines, so the burden falls on demonstrating control before, not after, crossing thresholds we cannot reverse. A blanket halt and unbridled racing are both weaker than the position the strongest arguments actually support: gating further capability gains on auditable, verified alignment progress.
Want the council to debate your own question?
Run one prompt through multiple AI models, compare the disagreement, and get a consensus summary.