img

New Delhi: AI research company Kimi has introduced Muon, a new scalable optimizer designed to improve the efficiency of training large language models (LLMs). The company claims that Muon can significantly reduce computational costs while improving performance, potentially transforming how companies manage the intensive computing demands of AI training.

Why Muon? Addressing LLM Training Challenges

Training large language models requires enormous computing power, often relying on traditional optimizers like AdamW. While effective, these optimizers come with scalability issues when training models at a massive scale.

Key Problems with Traditional Optimizers:
High computational overhead
Inefficient weight updates
Extensive hyperparameter tuning required

Muon tackles these challenges by introducing advanced matrix orthogonalization techniques and weight decay adjustments, delivering nearly twice the computational efficiency of AdamW, according to Kimi’s technical report.

How Muon Improves LLM Training

Muon introduces two major innovations that enhance scalability and stability:

Weight Decay Integration

  • Prevents parameters from growing too large, stabilizing training.

Optimized Per-Parameter Updates

  • Dynamically adjusts updates for each parameter, ensuring smoother training with less hyperparameter tuning.

Performance Breakthrough:
Kimi tested Muon on Moonlight, a 16B-parameter Mixture-of-Experts (MoE) model trained on 5.7 trillion tokens.

Results:

  • Moonlight outperformed existing LLMs while using significantly fewer FLOPs.
  • Muon required only 52% of the FLOPs needed by AdamW to achieve similar performance.

Open-Source & Future Applications

Kimi has made Muon open-source, allowing AI researchers and developers to integrate it into their models.

Why Muon Stands Out?
Memory-efficient design
Optimized for distributed AI training
Reduces overall infrastructure costs

Why Muon Matters for the Future of AI

As concerns grow over AI infrastructure oversupply, efficient optimizers like Muon could play a crucial role in reducing computational waste while still pushing the boundaries of AI model performance.