/topics/together-ai-oscar-kv-cache-quantization

Together AI OSCAR KV cache quantization

2 items●2 sources●updated 22d ago●trend 0

┌─ summary ─────────────────────────────┐

Together AI open-sourced OSCAR, an INT2 KV cache quantization method that uses attention-aware covariance structures to compress key-value caches in long-context LLM serving. At 2.28 bits per element, OSCAR achieves 8× memory reduction and up to 3× decode speedup while maintaining accuracy gaps of 1.42–3.78 points on Qwen3 models at 100K context length.

┌─ key points ──────────────────────────┐

OSCAR derives separate rotations for keys and values from offline attention-aware covariance, unlike prior data-oblivious Hadamard transforms
Achieves 2.28 bits per KV element with INT2 quantization
8× KV memory reduction and up to 3× decode speedup at 100K context length
Accuracy gap of 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B in BF16 comparison
Released as open-source by Together AI in May 2026

┌─ items (2) ───────────────────────────┐

[HN]hacker news1

DeepSeek Sparse Attention

HN: LLM · eigenBasis · ▲2 · 23d

[BLG]blog/rss1

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

MarkTechPost · Asif Razzaq · 22d