Manifold-constrained hyper-connections: stabilizing the future of large-scale model training

CommunicationNewsPerformance

Scaling large models with stable multi-stream connectivity

Manifold-constrained hyper-connections (mHC) introduce a mathematical framework to stabilize the training of large-scale neural networks by constraining residual connections to doubly stochastic matrices. This approach rectifies the inherent instability and memory inefficiencies of standard hyper-connections, enabling stable scaling for models up to 27B parameters. By integrating kernel fusion and optimized communication schedules, the system achieves superior performance with minimal computational overhead.

Points clés

  • Hyper-connections (HC) expand the width of the residual stream by a factor of $n$, but suffer from unbounded signal amplification.
  • Researchers identified that unconstrained HC leads to gradient gain spikes of up to three orders of magnitude in 27B models.
  • mHC constrains the residual mapping space to the Birkhoff polytope of doubly stochastic matrices to restore training stability.
  • The Sinkhorn-Knopp algorithm is utilized to perform manifold projection by rescaling matrix rows and columns.
  • System optimizations include specialized TileLang kernels that fuse operations like RMSNorm and matrix multiplication.
  • An extended DualPipe schedule was implemented to overlap the increased communication traffic of $n$-stream residuals with active computation.
  • Experimental results show mHC eliminates loss surges and outperforms standard baselines on benchmarks like BBH, MATH, and MMLU.
  • Scaling tests across 3B, 9B, and 27B parameter MoE models demonstrate robust stability and compute-optimal efficiency.
  • The proposed architectural improvements incur a marginal training overhead of only 6.7% despite the increased complexity.
  • Models were trained on up to 1.05T tokens using the AdamW optimizer to validate long-duration stability.

À retenir

If you have been wondering why your 27B parameter model keeps throwing a tantrum and exploding its gradients, it turns out that “unconstrained freedom” is as bad for neural networks as it is for a toddler in a candy store. By forcing your data to walk a very specific mathematical tightrope—the doubly stochastic manifold—you can finally stop staring at loss surges and start enjoying the benefits of wide-stream training. Sure, there is a 6.7% “tax” on your hardware efficiency, but that is a small price to pay for a model that actually finishes training without a digital nervous breakdown. Just follow the math, and maybe your silicon won’t cry so much.

Sources

Quiz sur le document: 10 questions

Loading