Data valuation

Attributions of value to training samples can be used to examine data, improve data acquisition, debug and improve models or compensate data providers. Recent developments in the field enable principled and useful definitions of value which overcome the computational cost of previous approaches.

Focusing on data

The core idea of so-called data-centric machine learning is that any effort spent on improving the quality of the data used to train a model is probably better spent than on improving the model itself. This tested rule of thumb is particularly relevant for applications where data is scarce, expensive to acquire or difficult to annotate.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, be it individual points or data-sets, have appeared in the ML literature. A core idea is to look at data points known to be “useful” in some sense — for instance in that they substantially contribute to the final performance of a model — and focus acquisition or labelling efforts around similar ones, while eliminating or “cleaning” the less useful ones, but there are many other applications, as we will see below.

If we focus on individual points, data valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it.

An emerging corpus of work focuses instead on whole sets, a practice which we will refer to as dataset valuation. Downstream applications of dataset valuation revolve around data markets, where data is bought and sold according to its value, and data acquisition, where the goal is to acquire the most valuable data for a given task.

The TransferLab maintains pyDVL: the python Data Valuation Library. It provides reference implementations of most popular methods for data valuation and influence analysis, with an emphasis on robustness, simple parallelization and ease of use.

The model-centric view

Following [Gho19D], the first approaches to data valuation have defined value not as an intrinsic property of the element of interest, but typically a function of three factors:

The dataset. Logically, the value of samples must be related to the dataset. For instance, duplicate points should be assigned no value, and prototypical ones, which provide little information, should arguably have a low value. Reciprocally, a high one could be an indication of being atypical (but in-distribution) and hence informative. More generally, the value of samples is understood to depend on the distribution they were sampled from,1 1 Value can be taken to be the (expected) contribution of a data point to any random set $D$ of fixed size sampled from the same distribution, and not to one in particular [Gho20D]. and a possibly distinct target distribution.

The model and algorithm. If a data valuation method is to be used to improve the quality of a model, intuitively the former cannot be independent of the model class and of the training algorithm.2 2 The algorithm is a function $A$ which maps the data $D$ to some estimator $A (D)$ in a model class $\mathcal{F}$. E.g. MSE minimization to find the parameters of a linear model or a neural network.

The performance metric. Finally, when value is tied to a model, it must be measured in some way which uses it, e.g. the $R^2$ score or the negative MSE over a test set, or whatever the performance metric of interest $u$ for the problem is. This metric will be computed over a held-out valuation set.

Beyond the classical setting

The three assumptions or dependencies above belong to a “classical”, model-centric view of valuation [Koh17U, Gho19D]. There is a growing amount of research on intrinsic definitions of value which only use properties of the data, like the optimal transport distance from one dataset to another [Jus23L], or measures of data complexity. This is interesting in applications where users do not want to rely on fixed models to measure the data’s worth.

Valuation methods can be classified along many dimensions, for example whether they require a model or not, if a specific class of models is required, or whether a reference data distribution is available or not. Without pretending to be exhaustive, in the model-centric category we find approaches rooted in game theory, in the influence function, in generalization bounds and in training dynamics. These are all tied to machine learning models in one way or another. On the other hand, model-free methods are mostly based on measure-theoretic properties of the data, like volume, or optimal transport distance. Below we quickly discuss a few of the above.3 3 We refer to the review [Sim22D] for an in-depth analysis of the field and its challenges, although recent methods are missing.

Marginal contribution methods

The first class of model-centric techniques uses game-theoretic concepts, and has as main contenders Shapley values [Gho19D, Kwo21E, Sch22C], their generalization to so-called semi-values [Kwo22B, Wan23D] and the Core [Yan21I]. A notable related approach for classification problems is Class-Wise Shapley [Sch22C]. All are based on so-called marginal contributions of data points to performance, for instance the difference in accuracy obtained when training with or without the point which is being valuated.

The simplest instance of such a method is Leave-One-Out (LOO) valuation, which defines the value of point $x_{i}$ in training set $D$ as

\[ v_{\operatorname{loo}} (x_{i}) := u (D) - u (D \setminus \lbrace x_{i} \rbrace), \]

where the utility $u (S) = u (M, S, D_{\operatorname{val}})$ is the performance (e.g. accuracy) of model $M$ trained on $S \subseteq D$, measured over a valuation set $D_{\operatorname{val}}$. This is an approximation of the expected performance over unseen data.4 4 It is worth noting that this way of measuring utility has the strong drawback of depending on how representative $D_{\operatorname{val}}$ is of the test-time distribution. This weakness is inherent to any method based on utility computations over a fixed set.

The marginal contribution of $x_{i}$ to the full dataset $D$ is, however, too weak a signal in most cases. Instead one can consider marginal contributions to every $S \subseteq D$ and aggregate them in some way, which is the basis of Shapley-based game-theoretic methods. The problem is framed as a cooperative game in which data points are players and the outcome of the game is the performance of the model when trained on subsets – coalitions – of the data, measured on $D_{\operatorname{val}}$ (the utility $u$). Different solution concepts from game theory lead to different techniques. The main questions addressed by practical methods are how to aggregate the marginal contributions and how to compute the solution efficiently and with low variance.

Computational complexity is an issue because the marginal contribution of a point would ideally be computed for every subset of the training set. An exact computation is then $\mathcal{O} (n 2^{n - 1})$ in the number of samples $n$, with each iteration requiring a full re-fitting of the model using a coalition as training set. Consequently, most methods involve Monte Carlo approximations, and sometimes approximate utilities which are faster to compute, e.g. proxy models [Wan22I] or constant-cost approximations like Neural Tangent Kernels [Wu22D]. For models exhibiting a certain local structure, like KNN, there are methods which exploit this locality to achieve even linear runtimes [Jia19aE].

In the general case, Monte Carlo approximations [Mal14B, Gho19D] and clever sampling strategies [Wu23V, Cas09P], can reduce sample complexities to polynomial times under some assumptions, but dataset size remains a major limitation of all game-theoretic methods.

A standard justification for the game theoretical framework is that in order to be useful, the value function $v$ is usually required to fulfill certain requirements of consistency and “fairness”. For instance, in most ML applications value should not depend on the order in which data are considered,5 5 Note that in some applications, like data markets (cf. Section 5) this property can be undesirable. or it should be equal for samples that contribute equally to any subset of the data of equal size. When considering aggregated value for subsets of data there are additional desiderata, like having a value function that does not increase with repeated samples. Game-theoretic methods guarantee some but not all of these properties, as well as others which are often touted as interesting for ML. However, despite their usefulness, none of them are either necessary or sufficient for all applications.6 6 For instance, Shapley values try to equitably distribute all value among all samples, failing to identify repeated ones as unnecessary, with e.g. a zero value.

Influence of training points

Instead of aggregating marginal contributions, there exists a more local concept, which is that of the influence that single training points have. Roughly speaking, an influence function encodes how much a given function of the training set would change if training data were slightly perturbed. Such functions include estimators of model parameters, or the empirical risk, in particular allowing to compute the influence that each training sample has over the loss on single test points. This makes influence functions naturally geared towards model interpretation, but they share several of the applications of valuation functions, and are sometimes employed together.

Alas, the influence function relies on some assumptions – like local invertibility of the Hessian – that can make their application unreliable. Yet another drawback is that they require the computation of the inverse of the Hessian of the model w.r.t. its parameters, which is intractable for large models like deep neural networks. Much of the recent research tackles this issue with approximations, like a truncated Neuman series [Aga17S], or a low-rank approximation built with dominant eigenspaces of the Hessian [Aga17S, Sch22S], which achieves much better results. More recently, K-FAC, a fast approximation to the inverse of Fisher’s Information Matrix has been used to enable scaling to even larger models [Gro23S].

A related technique, TracIn [Zha22R], follows the dynamics of training of the model to provide a first order approximation to the influence of a sample. The method is much cheaper than any approximation to the influence function, but it obtains mixed results.

Other model-centric methods

Computing marginal contributions does not scale well with data set size, and influence functions are limited by the number of parameters of the model, due to the need to (approximately) invert the Hessian of the loss.

An efficient and effective method is Data-OOB [Kwo23D], which uses a simple estimate of the out-of-bag generalization error of a bagging estimator for the values. The estimator is built with a user-provided model of interest and can prove useful for identifying low-quality samples in a dataset. Another example is CGS, the complexity-gap score of [Noh23D], which defines a data complexity measure based on the activations of ReLU neural network.

Some model-free techniques

There are some attempts to transfer model-centric values between models without fundamentally changing anything about the definition of value, cf. [Yon21W, Sch22C], but this shows at best mixed results. A whole body of research looks instead at model-agnostic or fully data-centric notions of value.

The first work to detach value from the model uses an idea of volume related to the determinant of the covariance matrix, but the few guarantees it provides are tied to very strong assumptions on model structure like linearity, and experiments where these are violated show rather disappointing results [Xu21V].

A much more promising alternative uses optimal transport (OT) distance to a reference validation dataset to define the value of a training set [Jus23L]. Value of individual instances is computed as the gradient of this distance.7 7 Optimal transport has an $\mathcal{O} (n^3)$ time complexity, but thanks to the convexification provided by the Sinkhorn estimator of Wasserstein distance, computational cost is kept well below this. The method, dubbed LAVA, comes with a bound that ensures low validation error if the data generating processes for training and validation sets is stable in the sense that it assigns the same labels to points with similar features. Additionally, its reduced computational cost enables scaling to tens of thousands of samples with complex models, a regime unaccessible to most GT methods.

Research feed

Seminar

Towards a statistical theory of data selection under weak supervision

Pulkit Tandon, research engineer at Granica, will present his work on data selection, showing how using surrogate models to select …

Data Valuation

Oct 30, 2024

Seminar

LAVA: Data Valuation Without Pre-Specified Learning Algorithms

Today’s talk is about LAVA, an Optimal-Transport-based approach to data valuation that dispenses with training of a model to compute …

Data Valuation

May 2, 2024

Pill

Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation

VRDS, an effective method to reduce the variance of Data Shapley, based on a stratified sampling technique with precomputed coefficients.

Data Valuation

Apr 1, 2024

Seminar

Studying LLMs with Influence Functions

Fabio Peruzzo, Senior AI Engineer at appliedAI Initiative, will talk about a recent work on the use of influence functions to study the …

Data Valuation

Feb 29, 2024

Pill

Numerous applications of Machine Learning in business intelligence, process optimization, product development or sales do not benefit from …

Efficient Machine Learning

Check all of our work

References

[Aga17S]

Second-Order Stochastic Optimization for Machine Learning in Linear Time, Naman Agarwal, Brian Bullins, Elad Hazan.

2017

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to eﬃcient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine …

[Cas09P]

Polynomial calculation of the Shapley value based on sampling, Javier Castro, Daniel Gómez, Juan Tejada.

May 2009

In this paper we develop a polynomial method based on sampling theory that can be used to estimate the Shapley value (or any semivalue) for cooperative games. Besides analyzing the complexity problem, we examine some desirable statistical properties of the proposed approach and provide some computational results.

Publication

[Gho19D]

Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou.

May 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, …

[Gho20D]

A Distributional Framework For Data Valuation, Amirata Ghorbani, Michael Kim, James Zou.

Nov 2020

Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for …

[Gro23S]

Studying Large Language Model Generalization with Influence Functions, Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman.

Aug 2023

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? …

[Jia19aE]

Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song.

Jul 2019

Given a data set D containing millions of data points and a data consumer who is willing to pay \$X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, …

[Jus23L]

LAVA: Data Valuation without Pre-Specified Learning Algorithms, Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia.

Feb 2023

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a …

[Koh17U]

Understanding Black-box Predictions via Influence Functions, Pang Wei Koh, Percy Liang.

Jul 2017

How can we explain the predictions of a black-box model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a …

[Kwo22B]

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning, Yongchan Kwon, James Zou.

Jan 2022

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, …

[Kwo23D]

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value, Yongchan Kwon, James Zou.

Jul 2023

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as …

[Kwo21E]

Efficient Computation and Analysis of Distributional Shapley Values, Yongchan Kwon, Manuel A. Rivas, James Zou.

Mar 2021

Distributional data Shapley value (DShapley) has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. DShapley develops the founda...

Publication

[Mal14B]

Bounding the Estimation Error of Sampling-based Shapley Value Approximation, Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, Alex Rogers.

Feb 2014

The Shapley value is arguably the most central normative solution concept in cooperative game theory. It specifies a unique way in which the reward from cooperation can be "fairly" divided among players. While it has a wide range of real world applications, its use is in many cases hampered by the hardness of its computation. A number of researchers have tackled this problem by (i) focusing on …

[Noh23D]

Data Valuation Without Training of a Model, Ki Nohyun, Hoyong Choi, Hye Won Chung.

Feb 2023

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a …

[Sch22S]

Scaling Up Influence Functions, Andrea Schioppa, Polina Zablotskaia, David Vilar, Artem Sokolov.

Jun 2022

We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) …

[Sch22C]

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification, Stephanie Schoch, Haifeng Xu, Yangfeng Ji.

Oct 2022

Data valuation, or the valuation of individual datum contributions, has seen growing interest in machine learning due to its demonstrable efficacy for tasks such as noisy label detection. In particular, due to the desirable axiomatic properties, several Shapley value approximations have been proposed. In these methods, the value function is usually defined as the predictive accuracy over the …

[Sim22D]

Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges, Rachael Hwee Ling Sim, Xinyi Xu, Bryan Kian Hsiang Low.

Jul 2022

Electronic proceedings of IJCAI 2022

Publication

[Wan23D]

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, Jiachen T. Wang, Ruoxi Jia.

Apr 2023

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the …

[Wan22I]

Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia.

Apr 2022

The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems. Both methods have recently been proposed as a principled solution for data valuation tasks, i.e., quantifying the contribution of individual datum in machine learning. However, both SV and LC suffer computational challenges due to the need for retraining models on …

[Wu22D]

DAVINZ: Data Valuation using Deep Neural Networks at Initialization, Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low.

Jun 2022

Recent years have witnessed a surge of interest in developing trustworthy methods to evaluate the value of data in many real-world applications (e.g., collaborative machine learning, data marketplaces). Existing data valuation methods typically valuate data using the generalization performance of converged machine learning models after their long-term model training, hence making data valuation on …

[Wu23V]

Variance reduced Shapley value estimation for trustworthy data valuation, Mengmeng Wu, Ruoxi Jia, Changle Lin, Wei Huang, Xiangyu Chang.

Nov 2023

Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, …

[Xu21V]

Validation Free and Replication Robust Volume-based Data Valuation, Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, Bryan Kian Hsiang Low.

2021

Data valuation arises as a non-trivial challenge in real-world use cases such as collaborative machine learning, federated learning, trusted data sharing, data marketplaces. The value of data is often associated with the learning performance (e.g., validation accuracy) of a model trained on the data, which introduces a close coupling between data valuation and validation. However, a validation …

Publication

[Yan21I]

If You Like Shapley Then You’ll Love the Core, Tom Yan, Ariel D. Procaccia.

May 2021

The prevalent approach to problems of credit assignment in machine learning — such as feature and data valuation— is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current …

[Yon21W]

Who's Responsible? Jointly Quantifying the Contribution of the Learning Algorithm and Data, Gal Yona, Amirata Ghorbani, James Zou.

Jul 2021

A learning algorithm A trained on a dataset D is revealed to have poor performance on some subpopulation at test time. Where should the responsibility for this lay? It can be argued that the data is responsible, if for example training A on a more representative dataset D' would have improved the performance. But it can similarly be argued that A itself is at fault, if training a different variant …

Publication

[Zha22R]

Rethinking Influence Functions of Neural Networks in the Over-Parameterized Regime, Rui Zhang, Shihua Zhang.

Jun 2022

Understanding the black-box prediction for neural networks is challenging. To achieve this, early studies have designed influence function (IF) to measure the effect of removing a single training point on neural networks. However, the classic implicit Hessian-vector product (IHVP) method for calculating IF is fragile, and theoretical analysis of IF in the context of neural networks is still …

Publication

Focusing on data

The model-centric view

Beyond the classical setting

Marginal contribution methods

Influence of training points

Other model-centric methods

Some model-free techniques

Research feed

Towards a statistical theory of data selection under weak supervision

Pulkit Tandon, research engineer at Granica, will present his work on data selection, showing how using surrogate models to select …

LAVA: Data Valuation Without Pre-Specified Learning Algorithms

Today&rsquo;s talk is about LAVA, an Optimal-Transport-based approach to data valuation that dispenses with training of a model to compute …

Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation

VRDS, an effective method to reduce the variance of Data Shapley, based on a stratified sampling technique with precomputed coefficients.

Studying LLMs with Influence Functions

Fabio Peruzzo, Senior AI Engineer at appliedAI Initiative, will talk about a recent work on the use of influence functions to study the …

DataPerf, Benchmarks for Data-Centric AI Development

Much of the recent advancements in AI have been fueled by the pursuit to enhance performance on benchmarks with fixed datasets, thus …

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Accurately computing influence functions involves solving inverse Hessian problems, a challenging task as the parameter count increases, …

Applications of data valuation in machine learning

At TransferLab we have extensively covered existing and developing methods for Data valuation, the task of attributing value to samples in a …

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

The out-of-bag (OOB) error estimate is a scalable approach to data valuation. Unlike marginal contribution methods, Data-OOB can leverage …

Studying Large Language Model Generalization with Influence Functions

Influence functions are a tool to quantify the impact of each training sample on a model&rsquo;s predictions, thereby assisting in the …

Beyond neural scaling laws: beating power law scaling via data pruning

Large neural networks are very &ldquo;data hungry&rdquo;, but data pruning holds the promise to alleviate this dependence. A recent paper …

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

When data samples are difficult to learn, neural networks tend to memorize their labels rather than infer useful features. This has been …

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification

Using in-class accuracy to up-weight the value of a data point and out-of-class accuracy as a discounting factor, the authors define a new …

Other series in Efficient Machine Learning

Data efficiency

Numerous applications of Machine Learning in business intelligence, process optimization, product development or sales do not benefit from …

References

Today’s talk is about LAVA, an Optimal-Transport-based approach to data valuation that dispenses with training of a model to compute …

Influence functions are a tool to quantify the impact of each training sample on a model’s predictions, thereby assisting in the …

Large neural networks are very “data hungry”, but data pruning holds the promise to alleviate this dependence. A recent paper …