←── back to feed
/topics/together-ai-oscar-kv-cache-quantization

Together AI OSCAR KV cache quantization

2 items2 sourcesupdated 22d agotrend 0

Together AI open-sourced OSCAR, an INT2 KV cache quantization method that uses attention-aware covariance structures to compress key-value caches in long-context LLM serving. At 2.28 bits per element, OSCAR achieves 8× memory reduction and up to 3× decode speedup while maintaining accuracy gaps of 1.42–3.78 points on Qwen3 models at 100K context length.

  • OSCAR derives separate rotations for keys and values from offline attention-aware covariance, unlike prior data-oblivious Hadamard transforms
  • Achieves 2.28 bits per KV element with INT2 quantization
  • 8× KV memory reduction and up to 3× decode speedup at 100K context length
  • Accuracy gap of 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B in BF16 comparison
  • Released as open-source by Together AI in May 2026