Models·3 min read·MarkTechPost

Xiaomi Pushes a 1-Trillion-Parameter Model Past 1,000 Tokens a Second — on Eight Off-the-Shelf GPUs

MiMo-V2.5-Pro-UltraSpeed broke the 1,000 tokens-per-second barrier on a one-trillion-parameter model using a single eight-GPU commodity node — roughly 15x faster than GPT-5.5 or Claude Opus, and without a line of custom silicon.

Xiaomi Pushes a 1-Trillion-Parameter Model Past 1,000 Tokens a Second — on Eight Off-the-Shelf GPUs
Share:

Xiaomi, working with the inference-systems startup TileRT, has done something no one had managed before: it ran a one-trillion-parameter language model at more than 1,000 tokens per second. The new MiMo-V2.5-Pro-UltraSpeed, unveiled on June 8 and opened for an API trial on June 9, decodes at over 1,000 tokens a second and peaks near 1,200 — and it does so on a single standard eight-GPU commodity node, not a rack of exotic accelerators. It is the first time anyone has cleared the 1,000-token-per-second barrier at trillion-parameter scale.

The speed gap over today's flagship chatbots is stark. By Xiaomi's own framing, GPT-5.5 generates roughly 68 tokens per second, Claude Opus 4.6 about 71, and even Google's snappy Gemini Flash around 192. UltraSpeed is therefore roughly 14 to 15 times faster than the leading reasoning models from OpenAI and Anthropic, and about six times faster than Gemini Flash. What makes the claim notable is not just the number but the hardware: rivals such as Cerebras hit comparable figures with wafer-scale custom silicon, and Groq leans on its own SRAM-heavy chip design. Xiaomi insists it reached the milestone on GPUs you could rent from a cloud provider tonight.

The trick is what Xiaomi calls "extreme model-system codesign," stacked in three layers. First, it selectively quantizes only the Mixture-of-Experts weights to four-bit precision (the MXFP4 format) while keeping the rest of the network at higher precision, using quantization-aware training so accuracy barely moves. Second, a speculative-decoding scheme it calls DFlash predicts whole masked blocks of tokens in a single forward pass, reporting acceptance lengths of 6.3 tokens on coding tasks and 5.6 on math and reasoning. Third, a runtime engine called TileRT keeps a persistent kernel resident on the GPU and uses warp specialization to drive operators at microsecond scale. Together they lift throughput roughly tenfold over the standard MiMo-V2.5-Pro baseline.

Raw speed at this level changes what an AI system can do, not just how fast it feels. At 1,000 tokens per second a model can explore dozens of reasoning paths in parallel — best-of-N sampling or tree search — and quietly verify and self-correct its own work in the background, rather than committing to a single answer. Xiaomi pitches the mode at latency-sensitive workloads like real-time agent loops, fraud detection, and trading-signal generation. UltraSpeed is priced at three times the standard MiMo-V2.5-Pro rate for roughly ten times the output speed, and it is API-only, with a gated, application-based trial running from June 9 to June 23.

Xiaomi has paired the announcement with an open release: a checkpoint named MiMo-V2.5-Pro-FP4-DFlash is on Hugging Face, and the TileRT runtime modules are posted on GitHub. The caveats are real, though. Access at launch is short and approval-based, independent third-party verification of the speed figures is not yet public, and the company concedes that DFlash's acceptance length — and therefore the headline throughput — drops in open-ended, conversational use rather than the structured coding and reasoning tasks where it shines.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles

RUMOREDCLAUDEFABLE 5PUBLIC MYTHOS?claude-fable-5 · rumoredUNVERIFIEDMODEL CARD · LEAKEDclaude-fable-5ArchitectureClaude 5Long contextextendedMulti-turnimprovedCyber toolingExploit devUNLOCKINGwith guardrailsBITSMINDS.COMReporting: Alex Heath
Models

Claude Fable 5: The Internet Is Sure It Ships Today. Anthropic Hasn't Said a Word.

OPEN MODELS · CODING · 1M CONTEXTMINIMAX M3 · JUNE 1MiniMax M3Open-weight · native multimodal · MSA attention~1/20 compute at 1M ctx · 9× faster prefill · 15× decodeBENCHMARK SCORECARD · MINIMAX-REPORTEDSWE-Bench Pro59%Terminal-Bench 2.166%SWE-fficiency34.8%BrowseComp83.5BROWSECOMPBeats Opus 4.783.5 vs 79.3Weights + technical report promised within 10 days · scores not yet independently verifiedBITSMINDS.COMSource: The Decoder · MiniMax
Models

MiniMax M3 Lands as an Open-Weight, Million-Token Coding Model That Claims to Edge Out GPT-5.5

REPORTEDLY LEAKEDCLAUDEOCEANUSANTHROPIC · MYTHOS LINEclaude-oceanus-v1-p · surfaced June 3 · unconfirmedBITSMINDS.COMSource: Cyber Security News
Models

Anthropic's Unreleased 'Claude Oceanus' Reportedly Leaked Through a Chinese Proxy Hours After Reaching Red Teamers