Rather than better models or more data, good data is very often the key to a successful application of machine learning. Sophisticated models can only go so far and, almost invariably for real business applications, improvements in data acquisition, annotation and cleaning are a much better investment of resources than researching complex models. As part of our mission to help practitioners to make the most of their time and their data, we have developed pyDVL, the python Data Valuation Library.
As of version 0.8.1, pyDVL provides robust, parallel implementations of most popular methods for data valuation. We are also developing a robust framework for the computation of influence functions, with lazy evaluation of influence factors, and out-of-core computation and parallelization, enabling the computation of influence functions for large models and datasets.
Finally, we are also working on a benchmarking suite to compare all methods. In the documentation, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples for most of them.
pip install pydvl
, or check out the documentationMethods for data valuation
- Leave One Out
- Data Shapley [Gho19D] values with different sampling methods
- Truncated Monte Carlo Shapley [Gho19D]
- Exact Data Shapley for KNN [Jia19aE]
- Owen sampling [Okh21M]
- Group testing Shapley [Jia19aE]
- Least Core [Yan21I]
- Data Utility Learning [Wan22I]
- Data Banzhaf [Wan23D]
- Beta Shapley [Kwo22B]
- Generalized semi-values, subsuming Shapley, Banzhaf and Beta-Shapley into one framework.
- Data-OOB [Kwo23D]
- Class-Wise Shapley [Sch22C]
Methods for influence functions
- Exact computation
- Conjugate Gradient [Koh17U]
- Linear (time) Stochastic Second-Order Algorithm (LiSSA) [Aga17S]
- Arnoldi iteration [Sch22S]
- Kronecker-factored Approximate Curvature [Mar15O]
- Eigenvalue-corrected Kronecker-Factored Approximate Curvature [Geo18F, Gro23S]
Roadmap
We are currently implementing or plan to implement:
- Standardized benchmarks for all valuation methods (v0.10)
- Improved parallelization strategies (v0.9)
- LAVA [Jus23L]
- $\delta$-Shapley [Wat23A]
- Neural Tangent Kernel scorer [Wu22D]
- Variance-reduced sampling methods for Shapley values
- (Approximate) Maximum Influence Perturbation [Bro21A]
To see what new methods, features and improvements are coming, check out the issues on GitHub.