How should we go about creating a science of deep learning? One might be tempted to focus on replicability, reproducibility, and careful statistics, but I will argue that these are often overemphasized and never sufficient. Instead, we should search for robust phenomena, aim to understand those phenomena in context, and design better measurement tools. I will show how to do this in the context of two questions: (1) What accounts for the double-descent phenomenon in deep learning? and (2) What makes neural networks more robust on out-of-distribution data? In both cases we will see how better measurement led to uncovering answer to these questions that make robustly correct predictions in a variety of settings.