Good quality datasets are essential to the successful training of supervised models. However, no matter how much attention is given to data cleaning, errors are bound to be included and this can cause poor performance even on relatively simple tasks.
In recent years, influence functions have re-emerged as a useful tool to estimate the impact of data samples on model’s predictions thanks to works such as [Koh17U]. First introduced in the 70s as a tool for robust statistics (see e.g. [Ham74I] ), they estimate the influence of each training data point on the model’s predictions using a derivative of the loss around a training sample.
Similarly, data resampling is a widely used strategy to deal with harmful samples. It re-weights input data based on their training loss: those with higher losses are assumed to have corrupted labels, and hence it could be beneficial to down-weigh them during training. Nevertheless, such loss-based resampling methods are known to have limitations, e.g. instabilities towards large portions of mislabelled data (for details see [Zha16U]).
To address these limitations, influence functions have recently been used (instead of training losses) in the resampling scheme. Inspired by the success of such approaches, [Kon22R] moves one step further and re-labels the harmful data-points (instead of just decreasing their weight) based on the results of influence analysis. The new approach is named RDIA, and can be found on GitHub.
But why would this be better than just removing the bad samples? In Figure 1 four datasets are progressively corrupted with different amounts of noise (as reported on the x-axis). Besides RDIA, the other methods are
- ERM: the usual full training.
- Random: randomly selecting and changes the label of training samples.
- UIDS: introduced in [Wan20L].
- Dropout: removes all training data with negative influence in the validation set.
RDIA performs better than all the other methods, and it shows good results even with very high noise. This is somewhat expected, since it is better in leveraging the information coming from the validation data, and corrects the labels accordingly in the training set.
While the applicability of this approach may be limited in practice, moving towards a data-centric development of ML models could yield many benefits. Model training and data cleaning, which typically are separate steps of the MLOps pipeline, are slowly becoming part of a two-legged iterative process, where the errors of one lead to the refinement of the other. In this paper it is shown that this could give surprisingly good improvements to model accuracy.