The main limitation of most data valuation methods is computational cost. The popular family based on the marginal contribution of samples to subsets of training data typically suffers from an exponential cost (see Data valuation for a quick summary of the different approaches).1 1 In the case of Leave-One-Out, the complexity is instead $\mathcal{O}(n)$ but the contribution of a single sample to the rest of the dataset can be small enough that the signal is lost in the noise, making the method unreliable (although surprisingly performant in some situations).

Despite attempts to reduce variance and improve convergence of Monte Carlo
estimates, like stratified sampling [Mal14B, Cas17I, Wu23V], or different approximation
strategies, like Lasso (introduced in *AME*, [Lin22M]),
this family of methods struggles to scale to larger datasets. Several forms of
antithetic sampling, which draw correlated instead of independent samples for
the Monte Carlo estimate, have also been proposed, with mixed results.2
2 The reason is a *No-Free-Lunch theorem* for
antithetics in [Ren19A]: for every fixed sampling
strategy, there is a utility function for which variance is not reduced.

These issues motivate looking for alternative definitions of value: **Data-OOB**
[Kwo23D] addresses them with an out-of-bag test-error
estimate. When the model of interest for valuation is a bagging estimator, this
enables very fast valuation of large datasets by reusing the already trained
weak learners (of course one can do the bagging with another model). Even faster
than KNN-Shapley [Jia19aE], which exploits the local
structure of KNN to provide a closed formula for the Shapley values.

Say one has trained weak learners $f_1, …, f_B$ and wants to compute a value for a training sample $z_i = (x_i, y_i)$. A subset $ \{f_{j}: j \in I_i\}$ of the learners won’t have seen this sample in their bootstrap datasets. Data-OOB defines the value to be

$$v(z_i) := \frac{1}{|I_i|} \sum_{j \in I_i} s(f_j(x_i), y_i),$$

for some scoring function $s$, like 1-0 or negative MSE. Note that the classical
OOB error estimate is the mean of the $v(z_i)$. Data-OOB is also statistically
interpretable; under certain assumptions, it identifies important data points
consistently with the *infinitesimal jackknife influence function*, see
Proposition 3.1 of [Kwo23D].3
3 Like the
jackknife, its
infinitesimal variant is used to reduce bias and estimate variance, but does so
using a Taylor expansion to approximate the effect of removing samples. The
*(empirical) influence function* is essentially the first order term in this
expansion: a directional derivative of the statistic in the direction of a
sample of interest. This is however a more fundamental concept than the
influence function as popularised in the ML literature by [Koh17U], where the chain rule is used to compute influence
over test points.

In their experiments, the authors show Data-OOB to vastly outperform AME and KNN-Shap in speed, and KNN-Shapley, Data Shapley and Beta Shapley in mislabeled data detection. Results are not so dramatic for the data removal task, where points of lowest value are successively removed from the dataset and test performance is measured each time, but the method is at least on par with the rest while being much faster (for a pre-trained bagged model).

The method performs best for the datasets depicted in Figure 1. For the remaining 8 datasets, seen in Figure 2 and Figure 3, the comparison is not so clear-cut, and at times a bit difficult to interpret. Still these are impressive results that warrant thorough reproduction! You can try the authors’ code (link in the references) or use TransferLab’s pyDVL: the python Data Valuation Library, where you will also find many other methods and examples. For Data-OOB you can use any model for the bagging.