Alibaba's Qwen3.7-Max Lands With a 1M-Token Context, AA Index 56.6, and a 35-Hour Agent Run
Qwen3.7-Max-Preview ships with a 1M-token context, extended-thinking mode, and benchmark gains across CritPt, Humanity's Last Exam, and Terminal-Bench Hard — clearing Gemini 3.5 Flash on the AA Intelligence Index.
Alibaba's Qwen team formally unveiled Qwen3.7-Max-Preview at the Alibaba Cloud Summit on May 20, 2026, two days after it landed on the company's API platform and roughly a week after it stealth-debuted on the LM Arena leaderboard. The proprietary reasoning model doubles the prior generation's context window to 1 million tokens, ships an extended-thinking mode that can generate roughly 97 million reasoning tokens during evaluation, and is pitched explicitly as an agentic workhorse capable of sustaining "hundreds or even thousands" of tool calls in a single run.
On the Artificial Analysis Intelligence Index v4.0, Qwen3.7-Max scores 56.6 — a 4.8-point jump over Qwen3.6 Max Preview (51.8), and the highest mark ever posted by a Chinese model on that leaderboard. It sits fifth overall, ahead of Google's Gemini 3.5 Flash (55.3) and within striking distance of Gemini 3.1 Pro Preview (57.2) and Anthropic's Claude Opus 4.7 (57.3). OpenAI's GPT-5.5 still leads the pack at 60.2.

The gains are concentrated where they matter most for autonomous workflows. CritPt climbed 9.7 percentage points (3.7% to 13.4%), Humanity's Last Exam jumped 9.2 points (28.9% to 38.1%), and Terminal-Bench Hard — a brutal proxy for shell-driven agent reliability — rose 6.9 points to 50.8%. On coding-specific benchmarks the model claims Terminal Bench 2.0-Terminus at 69.7, SWE-Verified at 80.4, and SWE-Multilingual at 78.3, putting it inside the frontier band on every dimension that matters to a real software-engineering agent.

Agentic capability is the headline pitch. On MCP-Mark Qwen3.7-Max scores 60.8 and on MCP-Atlas 76.4, and on SpreadSheetBench-v1 it lands an 87.0. Alibaba's own demos lean even harder: a 35-hour autonomous kernel-optimization run reportedly produced a 10x inference speedup, and the model is said to have sustained more than 1,000 sequential tool calls without falling out of its task. Those numbers come from internal testing and have not been independently verified, but the trajectory matches what frontier US labs have been claiming since GPT-5 and Claude Opus 4 shipped.
Reasoning is the other place where Qwen3.7-Max distinguishes itself from earlier open-weight flagships out of China. GPQA Diamond hits 92.4 — ahead of Claude Opus 4.6 Max's 91.3 — and on the Apex reasoning benchmark Qwen3.7-Max posts 44.5 against DeepSeek V4 Pro's 38.3. HMMT 2026 Feb climbs to 97.1. The trade-off, flagged by Artificial Analysis, is that the model now abstains more often on the AA-Omniscience knowledge battery (attempt rate fell from 67.3% to 48.0%) — a deliberate choice to say "I don't know" rather than confidently hallucinate.

Access is rolling out via Alibaba Cloud Model Studio (DashScope) with OpenAI- and Anthropic-compatible endpoints, and the chat interface is live at chat.qwen.ai. Public pricing on OpenRouter currently lists $2.50 per million input tokens and $7.50 per million output tokens — a notable premium over Qwen3.6 Max Preview ($1.30 / $7.80) but still well below the Western frontier. The closed-weights "Preview" tag means terms and behavior may shift, and there is no open release planned. Even so, Qwen3.7-Max is the clearest signal yet that the gap between US and Chinese frontier labs is now measured in months and tenths of an Index point — not generations.
Comments
Share your thoughts. Be kind.
Loading comments…