When to use what Schatten-$p$ norm in deep learning?
A new research paper explores the optimal use of Schatten-p norms in deep learning, particularly in relation to optimizers like Muon. The study demonstrates that the effectiveness of these norms is dependent on the specific regime, with smaller Schatten-p geometries proving optimal in low-dimensional settings, including those relevant to Chinchilla scaling. This analysis also provides insights into why Muon-like methods favor large batches and offers a scaling rule for batch sizes across different values of p. AI
IMPACT Provides theoretical guidance on optimizing deep learning models, potentially improving training efficiency and performance.