Data valuation is the process of assigning a value to data points representing
their contribution to the performance of a given model wrt. some metric, as
part of a fixed dataset. Popular techniques are based on cooperative game
theory concepts like Shapley
value or the least
core which, despite their
interesting theoretical properties, have high computational cost: To obtain the
value of one training point $x \in D_{\operatorname{train}}$, one must retrain
for each of the $2^n$ subsets $S \in \mathcal{P} (D_{\operatorname{train}})$
and evaluate the performance of the obtained model (on a fixed validation set).
This performance is called the *utility of $S$*. Typically, some kind of Monte
Carlo approximation is used to avoid retraining $2^n$ times. We provide an
introduction to and survey of key developments in the field in our series page on data valuation.

In [Wan22I] the authors propose retraining
only a number $m_{\operatorname{train}} \ll 2^n$ of times. Each subset
$S_{i}$ seen is identified with a binary 1-hot encoded vector $b_{i} \in
\lbrace 0, 1\rbrace^n$ of the indices present in $S_{i}$. After computing the
utility for each of the $S_{i}$, one has a collection of *utility samples*,
which are the tuples $\mathcal{S}_{\operatorname{train}} := \lbrace (b_{i}, u
(b_{i})) : i = 1, \ldots, m_{\operatorname{train}} \rbrace$. These can be
used to learn a surrogate utility function $\hat{u}$ which is then evaluated on
unseen subsets of $D_{\operatorname{train}}$ to obtain
$\mathcal{S}_{\operatorname{pred}} := \lbrace (b_{i}, \hat{u} (b_{i})) : i
= 1, \ldots, m_{\operatorname{pred}} \rbrace$. The value is then estimated
using $\mathcal{S}_{\operatorname{train}} \cup
\mathcal{S}_{\operatorname{pred}}$. Since each evaluation of $\hat{u}$ is
typically much cheaper than a full retraining, the speed-ups can be
considerable.

The authors also prove some average and worst case bounds for the error approximating $u$ (i.e. when the noise $\hat{u} - u$ is smooth and when it is adversarially chosen, respectively), and demonstrate empirically that using the estimated utility samples $\mathcal{S}_{\operatorname{pred}}$ in Monte Carlo methods results in decent approximations to the true value at a reduced cost. Check Theorems 1 and 2 in the paper for the details.

We will be sure to include this technique in our upcoming pyDVL library for everybody to try out in their projects.