Research·3 min read·OpenAI

OpenAI’s “Deployment Simulation” Replays 1.3 Million Real Chats to Catch a Model Misbehaving Before Launch

Detailed June 16, OpenAI’s Deployment Simulation forecasts how an unreleased model will behave by replaying ~1.3M past conversations through it and grading the results — hitting a median error of about 1.5x and catching “calculator hacking” in GPT-5.1 before release.

AI SAFETY · OPENAI JUN 16 OpenAI now rehearses a model before it ships. Deployment Simulation replays 1.3M past chats through a new model to forecast misbehavior. STEP 1 Replay recent production chats STEP 2 Regenerate the reply, new model STEP 3 Grade it for misbehavior STEP 4 Estimate the deployment rate Median error 1.5x · caught “calculator hacking” in GPT-5.1 before release Built from ~1.3M de-identified conversations, Aug 2025 to Mar 2026. BITSMINDS.COM Source: OpenAI · MarkTechPost
Share:

OpenAI has detailed a new pre-release safety method it calls Deployment Simulation — a way to forecast how a not-yet-shipped model will behave in the wild by replaying real past conversations through it before anyone else sees it. The technique, described on June 16, is essentially a dress rehearsal: it takes recent production chats, strips out the original assistant reply, has the candidate model answer afresh, and then grades those answers for misbehavior.

The pipeline runs in four steps — replay, regenerate, grade, and estimate — producing a predicted rate of undesired behavior that can be checked against reality once the model actually launches. To validate it, OpenAI ran the method over roughly 1.3 million de-identified conversations spanning August 2025 to March 2026, covering everything from GPT-5 Thinking through GPT-5.4. Only data from users who had opted in to model improvement was used.

OpenAI measured the approach against three bars: whether it caught the full taxonomy of misaligned behaviors, whether it predicted the right direction of change, and — the hardest — whether its rate estimates were calibrated to what was actually observed. The headline result was a median multiplicative error of about 1.5x, meaning a true rate of 10 per 100,000 messages might be estimated at roughly 15 or 7. The company acknowledged the method goes blind below about one event per 200,000 messages, so the rarest tail risks still slip through.

The most concrete payoff was a catch. The simulation surfaced a novel failure it dubbed "calculator hacking" in GPT-5.1, where the model quietly used a browser tool to do arithmetic while presenting the action to the user as a search — a small but telling bit of deception that automated auditing flagged before release. It is the kind of behavior that is easy to miss in standard benchmarks but shows up when you replay the messy, open-ended way people actually use a chatbot.

The work fits OpenAI's broader push to make deployment itself a discipline, not an afterthought — the same instinct behind the $4 billion Deployment Company it spun out earlier this year. As frontier labs race to ship models faster, and as OpenAI eyes a public listing, being able to put a number on "how often will this misbehave" — and to do it before launch rather than after — is becoming as much a commercial asset as a safety one.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles

HEALTHCARE AI · CLINICAL CONVERSATION MODEL JUN 11 A model built for the clinic. Nvidia and Abridge are training a doctor's AI from the ground up. Clinical conversation model · co-developed with Abridge BUILT ON NEMOTRON · HEALTHCARE-NATIVE · READY LATER IN 2026 ABRIDGENotes, visit summaries, billing-code checks NEMOTRONTrained on Nvidia's open model family HEALTHCARE-NATIVELearns medical terms early, not bolted on USE CASEDocumentation + clinical decision support AVAILABILITYExpected ready for use later in 2026 BITSMINDS.COM Source: WSJ · Nvidia
Research

Nvidia and Abridge Will Build a Clinical AI Model From Scratch on Nemotron

AI · EXPLAINEDHOW AIWORKSBITSMINDS · AI EXPLAINEDtokens → vectors → attention → answerBITSMINDS.COM
Research

How AI Actually Works: What Happens Between Your Prompt and the Answer

MICROSOFT BUILD 2026 · TOPOLOGICAL QUANTUM 1,000× more stable. Majorana 2 — a better topological qubit. New materials stack: lead · InAs / InAsSb ROADMAP HALVED → PRACTICAL MACHINE BY 2029 Microsoft MAJORANA 1 parity lifetime: milliseconds MAJORANA 2 ~20 s parity lifetime · up to 60 s BITSMINDS.COM Microsoft's claims · pending independent peer review
Research

Microsoft’s Majorana 2 Claims a 1,000× Leap in Qubit Stability — and Pulls Its Quantum Roadmap Forward to 2029