Scaling large models with stable multi-stream connectivity
Manifold-constrained hyper-connections (mHC) introduce a mathematical framework to stabilize the training of large-scale neural networks by constraining residual connections to doubly stochastic matrices. This approach rectifies the inherent instability and memory inefficiencies of standard hyper-connections, enabling stable scaling for models up to 27B parameters. By integrating kernel fusion and optimized communication schedules, the system achieves superior performance with minimal computational overhead.
Points clés
- Hyper-connections (HC) expand the width of the residual stream by a factor of $n$, but suffer from unbounded signal amplification.
- Researchers identified that unconstrained HC leads to gradient gain spikes of up to three orders of magnitude in 27B models.
- mHC constrains the residual mapping space to the Birkhoff polytope of doubly stochastic matrices to restore training stability.
- The Sinkhorn-Knopp algorithm is utilized to perform manifold projection by rescaling matrix rows and columns.
- System optimizations include specialized TileLang kernels that fuse operations like RMSNorm and matrix multiplication.
- An extended DualPipe schedule was implemented to overlap the increased communication traffic of $n$-stream residuals with active computation.
- Experimental results show mHC eliminates loss surges and outperforms standard baselines on benchmarks like BBH, MATH, and MMLU.
- Scaling tests across 3B, 9B, and 27B parameter MoE models demonstrate robust stability and compute-optimal efficiency.
- The proposed architectural improvements incur a marginal training overhead of only 6.7% despite the increased complexity.
- Models were trained on up to 1.05T tokens using the AdamW optimizer to validate long-duration stability.
À retenir
If you have been wondering why your 27B parameter model keeps throwing a tantrum and exploding its gradients, it turns out that “unconstrained freedom” is as bad for neural networks as it is for a toddler in a candy store. By forcing your data to walk a very specific mathematical tightrope—the doubly stochastic manifold—you can finally stop staring at loss surges and start enjoying the benefits of wide-stream training. Sure, there is a 6.7% “tax” on your hardware efficiency, but that is a small price to pay for a model that actually finishes training without a digital nervous breakdown. Just follow the math, and maybe your silicon won’t cry so much.
Sources
Quiz sur le document: 10 questions






