Google TurboQuant Slashes LLM Memory 6x with Zero Accuracy Loss
Google Research unveils TurboQuant, a training-free algorithm that compresses the KV cache of large language models by up to 6x at 3-bit precision — with no retraining, no calibration data, and no measurable quality drop.
Google Research has unveiled TurboQuant, a breakthrough vector quantization algorithm that compresses the Key-Value (KV) cache of large language models down to 3–4 bits per element — achieving a 6x memory reduction with virtually zero accuracy loss. The paper (arXiv 2504.19874) was accepted at ICLR 2026, where it will be formally presented in Rio de Janeiro on April 25.
The KV cache is one of the most stubborn bottlenecks in LLM inference: it grows linearly with sequence length and can consume tens of gigabytes of GPU memory for long-context workloads. Most prior compression approaches require fine-tuning or calibration data, limiting their practicality. TurboQuant sidesteps both requirements through a two-stage pipeline: first, a random orthogonal rotation is applied to each KV vector, spreading its energy uniformly across all coordinates. This rotation transforms each coordinate into a predictable statistical distribution, enabling the second stage — a mathematically optimal set of quantization buckets precomputed using the Lloyd-Max algorithm — to encode values with extreme precision.
The results are striking. At 3.5-bit precision, TurboQuant matches full 32-bit floating-point performance exactly on standard benchmarks. At 4-bit precision, it delivers up to 8x speedup on H100 GPU attention logit computation compared to 32-bit keys, while reducing KV cache memory footprint by 6x. The algorithm requires no training data, no calibration, and no model-specific tuning — making it compatible with any transformer architecture out of the box.
The broader implications extend well beyond raw inference speed. Cheaper KV cache memory means longer context windows are now economically viable on commodity hardware, reduced inference costs for cloud providers, and faster time-to-first-token for end users. TechCrunch noted that developers are already comparing TurboQuant to the fictional Pied Piper compression algorithm from Silicon Valley — a comparison that has stuck in the community. Multiple open-source PyTorch and Triton implementations have appeared on GitHub ahead of Google's official code release, expected alongside the ICLR presentation later this month.