How and why are we succeeding in training huge non-convex deepnetworks? How can deep neural networks with billions of parameters generalize well, despite having enough capacity to overfit any data? What is the true inductive bias of deep learning? And does it all just boil down to a big fancy kernel machine? In this talk I will highlight the central role the optimization geometry, and optimization dynamics, play in determining the inductive bias of deep learning, and how we might understand it in function space. I will present the view we have been developing over the past five years, and then discuss some issues we are currently grappling with.