# Focusing on data

The core idea of so-called **data-centric machine learning** is that any effort
spent on improving the quality of the data used to train a model is probably
better spent than on improving the model itself. This tested rule of thumb is
particularly relevant for applications where data is scarce, expensive to
acquire or difficult to annotate.

Concepts of the usefulness of a datum or its influence on the outcome of a
prediction have a long history in statistics and ML, in particular through the
notion of the *influence function*. However, it has only been recently that
rigorous and practical notions of *value* for data, be it individual points or
data-*sets*, have appeared in the ML literature. A core idea is to look at
data points known to be “useful” in some sense — for instance in that they
substantially contribute to the final performance of a model — and focus
acquisition or labelling efforts around similar ones, while eliminating or
“cleaning” the less useful ones, but there are many other applications, as we
will see below.

If we focus on individual points, **data valuation for machine learning** is
the task of assigning a scalar to each element of a training set which reflects
its contribution to the final performance of some model trained on it.

An emerging corpus of work focuses instead on whole sets, a practice which
we will refer to as **dataset valuation**. Downstream applications of
dataset valuation revolve around *data markets*, where data is bought and
sold according to its value, and **data acquisition**, where the goal is to
acquire the most valuable data for a given task.

The TransferLab maintains pyDVL: the python Data Valuation Library. It provides reference implementations of most popular methods for data valuation and influence analysis, with an emphasis on robustness, simple parallelization and ease of use.

## The model-centric view

Following [Gho19D], the first approaches to data valuation have defined value not as an intrinsic property of the element of interest, but typically a function of three factors:

**The dataset.** Logically, the value of samples must be related to the
dataset. For instance, duplicate points should be assigned no value, and
prototypical ones, which provide little information, should arguably have a low
value. Reciprocally, a high one could be an indication of being atypical (but
in-distribution) and hence informative. More generally, the value of samples is
understood to depend on the distribution they were sampled from,1
1 Value can be
taken to be the (expected) contribution of a data point to *any* random set $D$
of fixed size sampled from the same distribution, and not to one in particular
[Gho20D].
and a possibly distinct target distribution.

**The model and algorithm.** If a data valuation method is to be used to
improve the quality of a model, intuitively the former cannot be independent
of the model class and of the training algorithm.2
2 The algorithm is a function $A$ which
maps the data $D$ to some estimator $A (D)$ in a model class $\mathcal{F}$.
E.g. MSE minimization to find the parameters of a linear model or a neural
network.

**The performance metric.** Finally, when value is tied to a model, it must be
measured in some way which uses it, e.g. the $R^2$ score or the negative MSE
over a test set, or whatever the performance metric of interest $u$ for the
problem is. This metric will be computed over a held-out **valuation set**.

## Beyond the classical setting

The three assumptions or dependencies above belong to a “classical”, model-centric view of valuation [Koh17U, Gho19D]. There is a growing amount of research on intrinsic definitions of value which only use properties of the data, like the optimal transport distance from one dataset to another [Jus23L], or measures of data complexity. This is interesting in applications where users do not want to rely on fixed models to measure the data’s worth.

Valuation methods can be classified along many dimensions, for example whether they require a model or not, if a specific class of models is required, or whether a reference data distribution is available or not. Without pretending to be exhaustive, in the model-centric category we find approaches rooted in game theory, in the influence function, in generalization bounds and in training dynamics. These are all tied to machine learning models in one way or another. On the other hand, model-free methods are mostly based on measure-theoretic properties of the data, like volume, or optimal transport distance. Below we quickly discuss a few of the above.3 3 We refer to the review [Sim22D] for an in-depth analysis of the field and its challenges, although recent methods are missing.

# Marginal contribution methods

The first class of model-centric techniques uses game-theoretic concepts, and
has as main contenders *Shapley values* [Gho19D, Kwo21E, Sch22C], their generalization to
so-called *semi-values* [Kwo22B, Wan23D] and the
*Core* [Yan21I]. A notable related approach for
classification problems is Class-Wise Shapley [Sch22C]. All are based on so-called **marginal
contributions** of data points to performance, for instance the difference in
accuracy obtained when training with or without the point which is being
valuated.

The simplest instance of such a method is Leave-One-Out (LOO) valuation, which defines the value of point $x_{i}$ in training set $D$ as

\[ v_{\operatorname{loo}} (x_{i}) := u (D) - u (D \setminus \lbrace x_{i} \rbrace), \]

where the **utility** $u (S) = u (M, S, D_{\operatorname{val}})$ is the
performance (e.g. accuracy) of model $M$ trained on $S \subseteq D$, measured
over a valuation set $D_{\operatorname{val}}$. This is an approximation of the
expected performance over unseen data.4
4 It is worth noting that this way of measuring utility has
the strong drawback of depending on how representative $D_{\operatorname{val}}$
is of the test-time distribution. This weakness is inherent to any method based
on utility computations over a fixed set.

The marginal contribution of $x_{i}$ to the full dataset $D$ is, however, too
weak a signal in most cases. Instead one can consider marginal contributions to
every $S \subseteq D$ and aggregate them in some way, which is the basis of
Shapley-based game-theoretic methods. The problem is framed as a cooperative
game in which data points are players and the outcome of the game is the
performance of the model when trained on subsets – *coalitions* – of the data,
measured on $D_{\operatorname{val}}$ (the utility $u$). Different solution
concepts from game theory lead to different techniques. The main questions
addressed by practical methods are how to aggregate the marginal contributions
and how to compute the solution efficiently and with low variance.

Computational complexity is an issue because the marginal contribution of a
point would ideally be computed for *every* subset of the training set. An
exact computation is then $\mathcal{O} (n 2^{n - 1})$ in the number of samples
$n$, with each iteration requiring a full re-fitting of the model using a
coalition as training set. Consequently, most methods involve Monte Carlo
approximations, and sometimes approximate utilities which are faster to
compute, e.g. proxy models [Wan22I] or
constant-cost approximations like Neural Tangent Kernels [Wu22D]. For models exhibiting a certain local structure, like
KNN, there are methods which exploit this locality to achieve even linear
runtimes [Jia19aE].

In the general case, Monte Carlo approximations [Mal14B, Gho19D] and clever sampling strategies [Wu23V, Cas09P], can reduce sample complexities to polynomial times under some assumptions, but dataset size remains a major limitation of all game-theoretic methods.

A standard justification for the game theoretical framework is that in order to
be useful, the **value function** $v$ is usually required to fulfill certain
requirements of consistency and “fairness”. For instance, in most ML
applications value should not depend on the order in which data are
considered,5
5 Note that in some applications, like data markets
(cf. Section 5) this property can be
undesirable. or it should be equal for samples that
contribute equally to any subset of the data of equal size. When considering
aggregated value for subsets of data there are additional desiderata, like
having a value function that does not increase with repeated samples.
Game-theoretic methods guarantee some but not all of these properties, as well
as others which are often touted as interesting for ML. However, despite their
usefulness, none of them are either necessary or sufficient for all
applications.6
6 For instance, Shapley values try to equitably
distribute all value among all samples, failing to identify repeated ones as
unnecessary, with e.g. a zero value.

# Influence of training points

Instead of aggregating marginal contributions, there exists a more local
concept, which is that of the *influence* that single training points have.
Roughly speaking, an **influence function** encodes how much a given function
of the training set would change if training data were slightly perturbed. Such
functions include estimators of model parameters, or the empirical risk, in
particular allowing to compute the influence that each training sample has over
the loss on single test points. This makes influence functions naturally geared
towards model interpretation, but they share several of the applications of
valuation functions, and are sometimes employed together.

Alas, the influence function relies on some assumptions – like local invertibility of the Hessian – that can make their application unreliable. Yet another drawback is that they require the computation of the inverse of the Hessian of the model w.r.t. its parameters, which is intractable for large models like deep neural networks. Much of the recent research tackles this issue with approximations, like a truncated Neuman series [Aga17S], or a low-rank approximation built with dominant eigenspaces of the Hessian [Aga17S, Sch22S], which achieves much better results. More recently, K-FAC, a fast approximation to the inverse of Fisher’s Information Matrix has been used to enable scaling to even larger models [Gro23S].

A related technique, TracIn [Zha22R], follows the dynamics of training of the model to provide a first order approximation to the influence of a sample. The method is much cheaper than any approximation to the influence function, but it obtains mixed results.

# Other model-centric methods

Computing marginal contributions does not scale well with data set size, and influence functions are limited by the number of parameters of the model, due to the need to (approximately) invert the Hessian of the loss.

An efficient and effective method is Data-OOB [Kwo23D],
which uses a simple estimate of the out-of-bag generalization error of a bagging
estimator for the values. The estimator is built with a user-provided model of
interest and can prove useful for identifying low-quality samples in a dataset.
Another example is CGS, the *complexity-gap score* of [Noh23D], which defines a data complexity measure based on the
activations of ReLU neural network.

# Some model-free techniques

There are some attempts to transfer model-centric values between models
without fundamentally changing anything about the definition of value, cf.
[Yon21W, Sch22C], but this shows
at best mixed results. A whole body of research looks instead at model-agnostic
or fully **data-centric** notions of value.

The first work to detach value from the model uses an idea of volume related to the determinant of the covariance matrix, but the few guarantees it provides are tied to very strong assumptions on model structure like linearity, and experiments where these are violated show rather disappointing results [Xu21V].

A much more promising alternative uses optimal transport (OT) distance to a reference validation dataset to define the value of a training set [Jus23L]. Value of individual instances is computed as the gradient of this distance.7 7 Optimal transport has an $\mathcal{O} (n^3)$ time complexity, but thanks to the convexification provided by the Sinkhorn estimator of Wasserstein distance, computational cost is kept well below this. The method, dubbed LAVA, comes with a bound that ensures low validation error if the data generating processes for training and validation sets is stable in the sense that it assigns the same labels to points with similar features. Additionally, its reduced computational cost enables scaling to tens of thousands of samples with complex models, a regime unaccessible to most GT methods.