←── back to feed
/topics/together-ai-oscar-kv-cache-quantization
Together AI OSCAR KV cache quantization
2 items●2 sources●updated 22d ago●trend 0
Together AI open-sourced OSCAR, an INT2 KV cache quantization method that uses attention-aware covariance structures to compress key-value caches in long-context LLM serving. At 2.28 bits per element, OSCAR achieves 8× memory reduction and up to 3× decode speedup while maintaining accuracy gaps of 1.42–3.78 points on Qwen3 models at 100K context length.
- OSCAR derives separate rotations for keys and values from offline attention-aware covariance, unlike prior data-oblivious Hadamard transforms
- Achieves 2.28 bits per KV element with INT2 quantization
- 8× KV memory reduction and up to 3× decode speedup at 100K context length
- Accuracy gap of 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B in BF16 comparison
- Released as open-source by Together AI in May 2026