Research·3 min read·OpenAI

OpenAI’s “AI Chemist” Improved a Reaction Drug Makers Had Nearly Given Up On

Detailed June 17, OpenAI and Molecule.one paired GPT-5.4 with an autonomous lab to improve a stubborn Chan-Lam coupling — lifting yields for 88% of boronic acids and 83% of sulfonamides tested, with 8 of 14 validated reactions more than doubling.

RESEARCH · OPENAI JUN 17 An AI chemist cracked a stubborn drug reaction. GPT-5.4 and Molecule.one’s Maria Lab pushed a low-yielding Chan-Lam coupling much higher. 88% boronic acids improved 83% sulfonamides improved 8 / 14 validated reactions more than doubled GPT-5.4 proposed and ranked experiments; Molecule.one’s Maria Lab ran them. Human chemists steered the work and validated the result. Start to finish: about 2.5 months. BITSMINDS.COM Source: OpenAI · Molecule.one
Share:

OpenAI and the chemistry-automation startup Molecule.one say they have run a research project in which an AI system did most of the work of a medicinal chemist — reading the literature, dreaming up experiments, ranking them, and then driving the lab robots that carried them out. Detailed on June 17, the collaboration paired OpenAI's GPT-5.4 with Molecule.one's autonomous lab platform, nicknamed Maria, to tackle a reaction that has frustrated drug chemists for years.

The target was a stubborn version of the Chan-Lam coupling, a workhorse method for stitching together pharmaceutically relevant molecules. The specific variant — coupling primary sulfonamides with boronic acids — has historically delivered such low yields that chemists often avoid it, even though it would open up useful chemical space. Improving it is exactly the kind of tedious, high-variable optimization problem where it is hard to know in advance which knob to turn.

According to the writeup, GPT-5.4 reviewed prior studies, generated and scored a slate of research proposals, helped design the experiments, interpreted the data coming back from the bench, and suggested follow-ups — while human chemists stayed in the loop to choose which proposals to test and to validate the final result. The combined system worked through the problem over roughly two and a half months, with another half month for the human team to write everything up.

The numbers were encouraging. Across the broader screen, yields improved for 88% of the boronic acids and 83% of the sulfonamides tested. When human chemists hand-validated 14 representative reactions, 11 came back with higher yields — and 8 of those more than doubled. For a reaction that medicinal chemists had largely written off as unreliable, that is a meaningful jump, and the kind of incremental win that quietly widens what is synthesizable.

Not everyone was dazzled. On Hacker News, working chemists noted that the setup looks a lot like classic high-throughput screening with a smart optimization engine bolted on, and pushed back on the "AI chemist" branding — pointing out that the model proposes and ranks but does not, on its own, understand chemistry the way a trained scientist does. Even granting the skepticism, the project is a concrete data point in a larger story: frontier labs are trying to fold their models into the full scientific loop — hypothesis, experiment, analysis, iteration — rather than treating them as glorified search boxes.

It also lands as OpenAI leans harder into science as a proving ground for its models, a theme running through its recent work on life-sciences benchmarks and lab automation. The pitch is no longer just that a model can pass a chemistry exam, but that it can sit in a real lab, run a real campaign, and leave behind a result a human chemist would be glad to publish.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles

OpenAI's LifeSciBench Puts AI Through a 750-Task Life-Science Exam — Top Model Passes Just 36%
Research

OpenAI's LifeSciBench Puts AI Through a 750-Task Life-Science Exam — Top Model Passes Just 36%

AI SAFETY · OPENAI JUN 16 OpenAI now rehearses a model before it ships. Deployment Simulation replays 1.3M past chats through a new model to forecast misbehavior. STEP 1 Replay recent production chats STEP 2 Regenerate the reply, new model STEP 3 Grade it for misbehavior STEP 4 Estimate the deployment rate Median error 1.5x · caught “calculator hacking” in GPT-5.1 before release Built from ~1.3M de-identified conversations, Aug 2025 to Mar 2026. BITSMINDS.COM Source: OpenAI · MarkTechPost
Research

OpenAI’s “Deployment Simulation” Replays 1.3 Million Real Chats to Catch a Model Misbehaving Before Launch

HEALTHCARE AI · CLINICAL CONVERSATION MODEL JUN 11 A model built for the clinic. Nvidia and Abridge are training a doctor's AI from the ground up. Clinical conversation model · co-developed with Abridge BUILT ON NEMOTRON · HEALTHCARE-NATIVE · READY LATER IN 2026 ABRIDGENotes, visit summaries, billing-code checks NEMOTRONTrained on Nvidia's open model family HEALTHCARE-NATIVELearns medical terms early, not bolted on USE CASEDocumentation + clinical decision support AVAILABILITYExpected ready for use later in 2026 BITSMINDS.COM Source: WSJ · Nvidia
Research

Nvidia and Abridge Will Build a Clinical AI Model From Scratch on Nemotron