Mode Connectivity and Convergence of Gradient Descent for (Not So) Over-parameterized Deep Neural Networks
Marco Mondelli, IST Austria
Training a neural network is a non-convex problem that exhibits spurious and disconnected local minima. Yet, in practice neural networks with millions of parameters are successfully optimized using gradient descent methods. In this talk, I will give some theoretical insights on why this is possible. In the first part, I will focus on the problem of finding low-loss paths between the solutions found by gradient descent. First, using mean-field techniques, I will prove that, as the number of neurons grows, gradient descent solutions are approximately dropout-stable and, hence, connected. Then, I will present a mild condition that trades off the overparameterization with the quality of the features. In the second part, I will describe some tools to prove convergence of gradient descent to global optimality: the displacement convexity of a related Wasserstein gradient flow, and bounds on the smallest eigenvalue of neural tangent kernel matrices.
[Based on joint works with Pierre Brechet, Adel Javanmard, Andrea Montanari, Guido Montufar, Quynh Nguyen, and Alexander Shevchenko]