←── back to feed
/topics/sakana-ai-icml-2026-sparse-transformer-paper

Sakana AI ICML 2026 sparse transformer paper

3 items3 sourcesupdated 38d agotrend 0

Sakana AI and NVIDIA presented a paper at ICML 2026 introducing GPU kernels and data formats for faster inference and training of sparse transformer language models. The work builds on NVIDIA's Star Elastic method, which embeds multiple nested reasoning models (30B, 23B, 12B parameters) in a single checkpoint, reducing training tokens by 360× compared to training separate models.

  • Paper title: 'Sparser, Faster, Lighter Transformer Language Models' (arxiv.org/abs/2603.23198)
  • Star Elastic embeds three model sizes (30B, 23B, 12B) in one checkpoint via post-training
  • Single 160B-token training run replaces separate pretraining for each variant
  • 360× token reduction versus training each model independently from scratch
  • Open-source GPU kernels and data formats enable sparse transformer optimization