Data valuation

Attributions of value to training samples can be used to examine data, improve data acquisition, debug and improve models or compensate data providers. Recent developments in the field enable principled and useful definitions of value which overcome the computational cost of previous approaches.

Focusing on data

The core idea of so-called data-centric machine learning is that any effort spent on improving the quality of the data used to train a model is probably better spent than on improving the model itself. This tested rule of thumb is particularly relevant for applications where data is scarce, expensive to acquire or difficult to annotate.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature. The core idea is to look at data points known to be “useful” in some sense — for instance in that they substantially contribute to the final performance of a model — and focus acquisition or labelling efforts around similar ones, while eliminating or “cleaning” the less useful ones.

In a nutshell, data valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Depending on whether the value of a training sample depends on a specific model of interest or not, we talk of model-dependent or model-agnostic data valuation.

The TransferLab maintains pyDVL: the python Data Valuation Library. It provides reference implementations of most popular methods for data valuation and influence analysis, with an emphasis on robustness, simple parallelization and ease of use.

The classical model-dependent view

Up until very recently, most methods fell under the model-dependent category, in which value is not an intrinsic property of the element of interest, but typically a function of three factors:

The dataset. Logically, the value of samples must be related to the dataset. For instance, duplicate points should be assigned no value, and prototypical ones, which provide little information, should arguably have a low value. Reciprocally, a high one could be an indication of being atypical (but in-distribution) and hence informative. More generally, the value of samples is understood to depend on the distribution they were sampled from,1 1 Value can be taken to be the (expected) contribution of a data point to any random set $D$ of fixed size sampled from the same distribution, and not to one in particular [Gho20D]. and a possibly distinct target distribution.

The algorithm. If a data valuation method is to be used to improve the quality of a model, intuitively the former cannot be independent of the model class and of the training algorithm.2 2 The algorithm is a function $A$ which maps the data $D$ to some estimator $A (D)$ in a model class $\mathcal{F}$. E.g. MSE minimization to find the parameters of a linear model or a neural network.

The performance metric. Finally, when value is tied to a model, it must be measured in some way which uses it, e.g. the $R^2$ score or the negative MSE over a test set, or whatever the performance metric of interest $u$ for the problem is. This metric will be computed over a held-out valuation set.

Beyond the classical setting

The three assumptions or dependencies above belong to a “classical”, model-dependent view of valuation [Koh17U, Gho19D]. There is a growing amount of research on intrinsic definitions of value which only use properties of the data, like the optimal transport distance from one dataset to another [Jus23L], or measures of complexity. This is interesting in applications where users do not want to rely on fixed models to measure the data’s worth.

As mentioned, valuation methods can be divided into model dependent or model agnostic. Without pretending to be exhaustive, in the first category we find approaches rooted in game theory, in the influence function, in generalization bounds and in training dynamics. On the other hand, model agnostic methods are mostly based on measure-theoretic properties of the data, like volume, or optimal transport distance. Below we quickly review a few of the above.3 3 We refer to the review [Sim22D] for an in-depth analysis of the field and its challenges, although more recent methods are missing from the review.

Marginal contribution methods

The first class of model-dependent techniques uses game-theoretic concepts, and has as main contenders Shapley values [Gho19D, Kwo21E, Sch22C], their generalization to so-called semi-values [Kwo22B, Wan22D] and the Core [Yan21I]. A notable related approach for classification problems is Class-Wise Shapley [Sch22C]. All are based on so-called marginal contributions of data points to performance, for instance the difference in accuracy obtained when training with or without the point which is being valuated.

The simplest instance of such a method is Leave-One-Out (LOO) valuation, which defines the value of point $x_{i}$ in training set $D$ as

\[ v_{\operatorname{loo}} (x_{i}) := u (D) - u (D \setminus \lbrace x_{i} \rbrace), \]

where the utility $u (S) = u (M, S, D_{\operatorname{val}})$ is the performance (e.g. accuracy) of model $M$ trained on $S \subseteq D$, measured over a valuation set $D_{\operatorname{val}}$. This is an approximation of the expected performance over unseen data.4 4 It is worth noting that this way of measuring utility has the strong drawback of depending on how representative $D_{\operatorname{val}}$ is of the test-time distribution. This weakness is inherent to any method based on utility computations over a fixed set.

The marginal contribution of $x_{i}$ to the full dataset $D$ is, however, too weak a signal in most cases. Instead one can consider marginal contributions to every $S \subseteq D$ and aggregate them in some way, which is the basis of Shapley-based game-theoretic methods. The problem is framed as a cooperative game in which data points are players and the outcome of the game is the performance of the model when trained on subsets – coalitions – of the data, measured on $D_{\operatorname{val}}$ (the utility $u$). Different solution concepts from game theory lead to different techniques. The main questions addressed by practical methods are how to aggregate the marginal contributions and how to compute the solution efficiently and with low variance.

Computational complexity is an issue because the marginal contribution of a point would ideally be computed for every subset of the training set. An exact computation is then $\mathcal{O} (n 2^{n - 1})$ in the number of samples $n$, with each iteration requiring a full re-fitting of the model using a coalition as training set. Consequently, most methods involve Monte Carlo approximations, and sometimes approximate utilities which are faster to compute, e.g. proxy models [Wan22I] or constant-cost approximations like Neural Tangent Kernels [Wu22D]. For models exhibiting a certain local structure, like KNN, there are methods which exploit this locality to achieve even linear runtimes [Jia19aE].

In the general case, Monte Carlo approximations [Mal14B, Gho19D] and clever sampling strategies [Wu23V, Cas09P], can reduce sample complexities to polynomial times under some assumptions, but dataset size remains a major limitation of all game-theoretic methods.

A standard justification for the game theoretical framework is that in order to be useful, the value function $v$ is usually required to fulfill certain requirements of consistency and “fairness”. For instance, in most ML applications value should not depend on the order in which data are considered,5 5 Note that in some applications, like data markets (cf. Section 5) this property can be undesirable. or it should be equal for samples that contribute equally to any subset of the data of equal size. When considering aggregated value for subsets of data there are additional desiderata, like having a value function that does not increase with repeated samples. Game-theoretic methods guarantee some but not all of these properties, as well as others which are often touted as interesting for ML. However, despite their usefulness, none of them are either necessary or sufficient for all applications.6 6 For instance, Shapley values try to equitably distribute all value among all samples, failing to identify repeated ones as unnecessary, with e.g. a zero value.

Influence of training points

Instead of aggregating marginal contributions, there exists a more local concept, which is that of the influence that single training points have. Roughly speaking, an influence function encodes how much a given function of the training set would change if training data were slightly perturbed. Such functions include estimators of model parameters, or the empirical risk, in particular allowing to compute the influence that each training sample has over the loss on single test points. This makes influence functions naturally geared towards model interpretation, but they share several of the applications of valuation functions, and are sometimes employed together.

Alas, the influence function relies on some assumptions – like local invertibility of the Hessian – that can make their application unreliable. Yet another drawback is that they require the computation of the inverse of the Hessian of the model w.r.t. its parameters, which is intractable for large models like deep neural networks. Much of the recent research tackles this issue with approximations, like a truncated Neuman series [Aga17S], or a low-rank approximation built with dominant eigenspaces of the Hessian [Aga17S, Sch21S], which achieves much better results. More recently, K-FAC, a fast approximation to the inverse of Fisher’s Information Matrix has been used to enable scaling to even larger models [Gro23S].

A related technique, TracIn [Zha22R], follows the dynamics of training of the model to provide a first order approximation to the influence of a sample. The method is much cheaper than any approximation to the influence function, but it obtains mixed results.

Other model-dependent methods

Computing marginal contributions does not scale well with data set size, and influence functions are limited by the number of parameters of the model, due to the need to (approximately) invert the Hessian of the loss.

An efficient and effective method is Data-OOB [Kwo23D], which uses a simple estimate of the out-of-bag generalization error of a bagging estimator for the values. The estimator is built with a user-provided model of interest and can prove useful for identifying low-quality samples in a dataset. Another example is CGS, the complexity-gap score of [Noh23D], which defines a data complexity measure based on the activations of ReLU neural network.

Some model-agnostic techniques

There are some attempts to transfer model-dependent values between models without fundamentally changing anything about the definition of value, cf. [Yon21W, Sch22C], but this shows at best mixed results. A whole body of research looks instead at model-agnostic or fully data-centric notions of value.

The first work to detach value from the model uses an idea of volume related to the determinant of the covariance matrix, but the few guarantees it provides are tied to very strong assumptions on model structure like linearity, and experiments where these are violated show rather disappointing results [Xu21V].

A much more promising alternative uses optimal transport (OT) distance to a reference validation dataset to define the value of a training set [Jus23L]. Value of individual instances is computed as the gradient of this distance.7 7 Optimal transport has an $\mathcal{O} (n^3)$ time complexity, but thanks to the convexification provided by the Sinkhorn estimator of Wasserstein distance, computational cost is kept well below this. The method, dubbed lava, comes with a bound that ensures low validation error if the data generating processes for training and validation sets is stable in the sense that it assigns the same labels to points with similar features. Additionally, its reduced computational cost enables scaling to tens of thousands of samples with complex models, a regime unaccessible to most GT methods.

Check all of our work

References