OpenAI's LifeSciBench Puts AI Through a 750-Task Life-Science Exam — Top Model Passes Just 36%
OpenAI's new 750-task benchmark grades AI on real life-science research judgment. The best model, GPT-Rosalind, passed only 36% — and 22.8% of tasks stumped every model tested.
OpenAI on June 17 released LifeSciBench, a sprawling new benchmark that grades AI systems on the kind of messy, judgment-heavy work that real life-science research actually demands — and the early scores suggest the most advanced models still have a long way to go. The benchmark comprises 750 expert-authored tasks, and even the strongest model tested passed only about a third of them.
Unlike most biology benchmarks, which lean on narrow, fact-based questions with clean answers, LifeSciBench is built around free-response problems written the way one scientist would brief a colleague. The 750 tasks span seven workflows — evidence handling, analysis, design and optimization, scientific reasoning, validation, translation, and scientific communication — across seven biological domains including genomics, medicinal chemistry, and clinical science. Roughly 79% of the tasks require multiple reasoning steps, averaging four steps each, and many come bundled with real artifacts: the set ships with 1,062 attached sequences, figures, tables, PDFs, and chemical structures.
Scoring is rubric-based rather than multiple-choice. The benchmark defines 19,020 individual criteria — about 25 per task — each rewarding a specific fact, reasoning step, or numeric answer. Results are summarized two ways: a normalized rubric score that awards partial credit, and a stricter task pass rate counting only tasks that clear 70%. To build it, OpenAI enlisted a cohort of 173 Ph.D.-holding scientists to author tasks and 453 reviewers (97% with doctorates) to validate them, reaching 96% agreement on relevance and usefulness.
On the leaderboard, OpenAI's research-focused GPT-Rosalind led with a 0.576 normalized score and a 36.1% pass rate, ahead of GPT-5.5 (0.519, 25.7%), Google's Gemini 3.1 Pro (0.515, 23.6%), GPT-5.4 (0.479, 20.7%), and xAI's Grok 4.3 (0.399, 13.0%). The cracks showed most clearly when artifacts were involved: GPT-Rosalind's accuracy fell from 45.1% on text-only tasks to 28.1% when it had to reason over an attached figure or dataset.
Perhaps the most telling number is that 22.8% of the tasks were failed by every model tested — a stark reminder of how much headroom remains between today's frontier systems and the reasoning a working scientist takes for granted. The release lands alongside a companion OpenAI demonstration in which a near-autonomous AI chemist using GPT-5.4 improved a challenging reaction in medicinal chemistry, signaling the lab's growing push to position its models as genuine research collaborators rather than just question-answering tools.
Comments
Share your thoughts. Be kind.
Loading comments…