Final Oral Public Examination

Foundations of Deep Learning: Optimization and Representation Learning

Advisor: Jason Lee

Deep learning's remarkable success stems from the ability of neural networks to automatically discover meaningful representations from raw data during the optimization process. Part 1 of this thesis studies this representation learning process in simple models and Part 2 studies the optimization dynamics in more realistic deep learning settings.

Part 1 (Representation Learning) explores how neural networks trained with gradient descent learn meaningful representations that adapt to low-dimensional structure in the data. We focus on Gaussian multi-index models, in which the labels only depend on the projection of the inputs onto an unknown low-dimensional subspace. We show that generically (under a non-degeneracy assumption), neural networks are able to adapt to this low-dimensional structure after a single step of gradient descent. We also probe this non-degeneracy assumption in the single-index setting and prove matching upper and lower bounds for learning a general Gaussian single-index model under computational constraints (i.e. in the statistical query and low-degree polynomial frameworks).

Part 2 (Optimization) examines how hyperparameters influence the optimization process in realistic deep learning settings. We first analyze the implicit regularization effects of stochastic gradient descent, proving that SGD with label noise converges to stationary points of a regularized objective L(θ) + λR(θ), where the regularizer R(θ) penalizes sharp regions of the loss landscape. This work providing theoretical justification for the empirically observed benefits of large learning rates and small batch sizes. Next, we revisit the simplest optimizer, deterministic (i.e. full-batch) gradient descent. Cohen et al. (2021) showed that gradient descent on realistic neural networks often operates in an oscillatory regime dubbed the "edge of stability," which is not captured by existing analyses of gradient descent. We introduce the concept of "self-stabilization" – a negative feedback mechanism implicit in gradient descent which allows it to maintain its stability in challenging loss landscapes. Our theory provides the first theoretical explanation for the convergence of gradient descent in realistic deep learning settings. We also run extensive experiments to demonstrate our theory makes accurate predictions across a variety of architectures and datasets.