Xiaomi Pushes a 1-Trillion-Parameter Model Past 1,000 Tokens a Second — on Eight Off-the-Shelf GPUs

MiMo-V2.5-Pro-UltraSpeed broke the 1,000 tokens-per-second barrier on a one-trillion-parameter model using a single eight-GPU commodity node — roughly 15x faster than GPT-5.5 or Claude Opus, and without a line of custom silicon.

Xiaomi, working with the inference-systems startup TileRT, has done something no one had managed before: it ran a one-trillion-parameter language model at more than 1,000 tokens per second. The new MiMo-V2.5-Pro-UltraSpeed, unveiled on June 8 and opened for an API trial on June 9, decodes at over 1,000 tokens a second and peaks near 1,200 — and it does so on a single standard eight-GPU commodity node, not a rack of exotic accelerators. It is the first time anyone has cleared the 1,000-token-per-second barrier at trillion-parameter scale.

The speed gap over today's flagship chatbots is stark. By Xiaomi's own framing, GPT-5.5 generates roughly 68 tokens per second, Claude Opus 4.6 about 71, and even Google's snappy Gemini Flash around 192. UltraSpeed is therefore roughly 14 to 15 times faster than the leading reasoning models from OpenAI and Anthropic, and about six times faster than Gemini Flash. What makes the claim notable is not just the number but the hardware: rivals such as Cerebras hit comparable figures with wafer-scale custom silicon, and Groq leans on its own SRAM-heavy chip design. Xiaomi insists it reached the milestone on GPUs you could rent from a cloud provider tonight.

The trick is what Xiaomi calls "extreme model-system codesign," stacked in three layers. First, it selectively quantizes only the Mixture-of-Experts weights to four-bit precision (the MXFP4 format) while keeping the rest of the network at higher precision, using quantization-aware training so accuracy barely moves. Second, a speculative-decoding scheme it calls DFlash predicts whole masked blocks of tokens in a single forward pass, reporting acceptance lengths of 6.3 tokens on coding tasks and 5.6 on math and reasoning. Third, a runtime engine called TileRT keeps a persistent kernel resident on the GPU and uses warp specialization to drive operators at microsecond scale. Together they lift throughput roughly tenfold over the standard MiMo-V2.5-Pro baseline.

Raw speed at this level changes what an AI system can do, not just how fast it feels. At 1,000 tokens per second a model can explore dozens of reasoning paths in parallel — best-of-N sampling or tree search — and quietly verify and self-correct its own work in the background, rather than committing to a single answer. Xiaomi pitches the mode at latency-sensitive workloads like real-time agent loops, fraud detection, and trading-signal generation. UltraSpeed is priced at three times the standard MiMo-V2.5-Pro rate for roughly ten times the output speed, and it is API-only, with a gated, application-based trial running from June 9 to June 23.

Xiaomi has paired the announcement with an open release: a checkpoint named MiMo-V2.5-Pro-FP4-DFlash is on Hugging Face, and the TileRT runtime modules are posted on GitHub. The caveats are real, though. Access at launch is short and approval-based, independent third-party verification of the speed figures is not yet public, and the company concedes that DFlash's acceptance length — and therefore the headline throughput — drops in open-ended, conversational use rather than the structured coding and reasoning tasks where it shines.

Why It Matters

For the inference-hardware race: Cerebras and Groq built their pitch on the premise that breakneck speed requires custom silicon. A trillion-parameter model hitting 1,000 tokens per second on commodity GPUs undercuts that premise and hands the same lever to anyone with eight cards and the right software stack.

For agentic AI: When a single answer is effectively free in wall-clock terms, the economical move is to generate many and pick the best. UltraSpeed pushes the field further toward parallel search and self-verification as the default, rather than one-shot generation.

For China's AI ambitions: A consumer-electronics company most people associate with phones and rice cookers just leapfrogged specialist inference firms on a systems-engineering problem — a reminder that the edge in this race is increasingly about software-hardware codesign, not only model size.

Xiaomi Pushes a 1-Trillion-Parameter Model Past 1,000 Tokens a Second — on Eight Off-the-Shelf GPUs

Comments

Related Articles

Claude Fable 5: The Internet Is Sure It Ships Today. Anthropic Hasn't Said a Word.

MiniMax M3 Lands as an Open-Weight, Million-Token Coding Model That Claims to Edge Out GPT-5.5

Anthropic's Unreleased 'Claude Oceanus' Reportedly Leaked Through a Chinese Proxy Hours After Reaching Red Teamers