Improving cooperative game theory-based data valuation via data utility learning

Data valuation is typically limited by the exponential complexity of retraining for each possible subset of training points. This paper proposes to instead learn the map from indices to utility.

Data valuation is the process of assigning a value to data points representing their contribution to the performance of a given model wrt. some metric, as part of a fixed dataset. Popular techniques are based on cooperative game theory concepts like Shapley value or the least core which, despite their interesting theoretical properties, have high computational cost: To obtain the value of one training point $x \in D_{\operatorname{train}}$, one must retrain for each of the $2^n$ subsets $S \in \mathcal{P} (D_{\operatorname{train}})$ and evaluate the performance of the obtained model (on a fixed validation set). This performance is called the utility of $S$. Typically, some kind of Monte Carlo approximation is used to avoid retraining $2^n$ times. We provide an introduction to and survey of key developments in the field in our series page on data valuation.

In [Wan22I] the authors propose retraining only a number $m_{\operatorname{train}} \ll 2^n$ of times. Each subset $S_{i}$ seen is identified with a binary 1-hot encoded vector $b_{i} \in \lbrace 0, 1\rbrace^n$ of the indices present in $S_{i}$. After computing the utility for each of the $S_{i}$, one has a collection of utility samples, which are the tuples $\mathcal{S}_{\operatorname{train}} := \lbrace (b_{i}, u (b_{i})) : i = 1, \ldots, m_{\operatorname{train}} \rbrace$. These can be used to learn a surrogate utility function $\hat{u}$ which is then evaluated on unseen subsets of $D_{\operatorname{train}}$ to obtain $\mathcal{S}_{\operatorname{pred}} := \lbrace (b_{i}, \hat{u} (b_{i})) : i = 1, \ldots, m_{\operatorname{pred}} \rbrace$. The value is then estimated using $\mathcal{S}_{\operatorname{train}} \cup \mathcal{S}_{\operatorname{pred}}$. Since each evaluation of $\hat{u}$ is typically much cheaper than a full retraining, the speed-ups can be considerable.

The authors also prove some average and worst case bounds for the error approximating $u$ (i.e. when the noise $\hat{u} - u$ is smooth and when it is adversarially chosen, respectively), and demonstrate empirically that using the estimated utility samples $\mathcal{S}_{\operatorname{pred}}$ in Monte Carlo methods results in decent approximations to the true value at a reduced cost. Check Theorems 1 and 2 in the paper for the details.

We will be sure to include this technique in our upcoming pyDVL library for everybody to try out in their projects.


In this series