Rather than better models or more data, *good* data is very often they key to a
successful application of machine learning. Sophisticated models can only go so
far and, almost invariably for real business applications, improvements in data
acquisition, annotation and cleaning are a much better investment of resources
than researching complex models. As part of our mission to help practitioners
to make the most of their time and their data, we have developed **pyDVL**, the
python Data Valuation Library.

As of **version 0.8.1**, pyDVL provides robust, parallel implementations of most
popular methods for data valuation. We are also developing a robust framework
for the computation of influence functions, with lazy evaluation of influence
factors, and out-of-core computation and parallelization, enabling the
computation of influence functions for large models and datasets.

Finally, we are also working on a benchmarking suite to compare all methods. In the documentation, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples for most of them.

`pip install pydvl`

, or check out the documentation## Methods for data valuation

- Leave One Out
- Data Shapley [Gho19D] values with different sampling methods
- Truncated Monte Carlo Shapley [Gho19D]
- Exact Data Shapley for KNN [Jia19aE]
- Owen sampling [Okh21M]
- Group testing Shapley [Jia19aE]
- Least Core [Yan21I]
- Data Utility Learning [Wan22I]
- Data Banzhaf [Wan22D]
- Beta Shapley [Kwo22B]
- Generalized semi-values, subsuming Shapley, Banzhaf and Beta-Shapley into one framework.
- Data-OOB [Kwo23D]
- Class-Wise Shapley [Sch22C]

## Methods for influence functions

- Exact computation
- Conjugate Gradient [Koh17U]
- Linear (time) Stochastic Second-Order Algorithm (LiSSA) [Aga17S]
- Arnoldi iteration [Sch21S]
- Kronecker-factored Approximate Curvature [Mar15O]
- Eigenvalue-corrected Kronecker-Factored Approximate Curvature [Geo18F, Gro23S]

## Roadmap

We are currently implementing or plan to implement:

- Standardized benchmarks for all valuation methods (v0.10)
- Improved parallelization strategies (v0.9)
- LAVA [Jus23L]
- $\delta$-Shapley [Wat23A]
- Neural Tangent Kernel scorer [Wu22D]
- Variance-reduced sampling methods for Shapley values
- (Approximate) Maximum Influence Perturbation [Bro21A]

To see what new methods, features and improvements are coming, check out the issues on GitHub.