This post is about a recent paper out of DeepMind [Wil22F].
Generalization i.e. the ability to perform well on previously unseen data, is of fundamental importance for virtually all applications of machine learning. When studying a model’s ability to generalize, one often assumes that the training data and the test data are generated from the same distribution (in other words, with the same mechanism). However, in many if not most applications the data generation mechanism during inference is not exactly the same as the one used for collecting training data - a situation which is called distribution shift.
It is not easy to analyze how well a model behaves under distribution shift. One reason is that this shift is typically not known - after all, if one knew its nature, one might have adjusted the training set accordingly. Also, an ML model has no a priori reason to behave well under distribution shift, at least not under the standard empirical risk minimization procedure that is often used for training it.
The paper mentioned above introduces a framework for generating distribution shift in a controlled way and analyses how various popular strategies for making models robust to it (like pre-training, augmentation, weighted resampling, using a GAN, and others) perform. In a significant engineering effort, 19 different methods were tested across different distribution shifts, datasets and models. In total, over 85k models were trained. The analysis was restricted to computer vision but the ideas are more general and can be used in other areas of ML as well. The code for this was open-sourced and should be a valuable contribution for practitioners who have at least some idea about how the data at inference might differ from their training data.
Apart from introducing the framework for distribution shifts (which in itself is based on a rather simple factorization of the data’s generative distribution), the paper’s main contribution is an extensive and detailed experiment report. Several practical strategies for dealing with distribution shift are given. The main takeaway for me personally is the following: there is no single method that works best, it always depends on the data, the model and the kind of distribution shift at hand (see image below). While certain tendencies can be identified, in the end an experimental evaluation will always be necessary to study the behavior of one’s model under distribution shift. The newly released codebase should help with that, I can imagine using it in the near future.
For details about the framework, experimental results and efficacy of existing methods, I recommend reading the paper. Being an experimental report, it is a nice read which does not require much theoretical background and has a wide applicability.