In two of our previous paper pills about Data Valuation methods, namely Beta-Shapley and Data Banzhaf, we discussed how to make the computed values more robust to noise in the utility evaluation.
Whereas [Kwo22B] and [Wan23D] address the noise in stochastic utilities, in the paper we study here, [Wu23V], the authors improve the quality of the values by tackling a different problem. They reduce the variance of the Monte Carlo estimate using stratified sampling, similarly to previous work in [Mal14B] and [Cas17I]. They call their method Variance Reduce Data Shapley (VRDS).
Starting with the combinatorial formulation of Shapley values, one can divide the coalitions into strata according to their size when calculating the marginal contributions:
$$\begin{align} v(x_i) &= \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \binom{n-1}{k}^{-1} \left[u(S \cup \{x_i\}) − u(S)\right]\\ &= \frac{1}{n} \sum_{k=0}^{n-1} \binom{n-1}{k}^{-1} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \left[u(S \cup \{x_i\}) − u(S)\right]. \end{align}$$
Grouping by size the coalitions which do not contain the data point $i,$ we have $n$ strata $S_0, S_1, \dots , S_{n−1}$ such that $S_k = \{S \subseteq N \ {i}, |S| = k\}$ contains all the coalitions with size $k.$
The marginal contribution of data point $i$ within stratum $S_k$ is:
$$ v^k(x_i) = \binom{n-1}{k}^{-1} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \left[u(S \cup \{x_i\}) − u(S)\right], $$
and with it, the Shapley value of point $i$ can be written as:
$$ v(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} v^k(x_i). $$
A standard Monte Carlo approximation (to reduce the number of evaluations of $u$) is then:
$$ \hat{v}(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} \hat{v}^k(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} \frac{1}{m_k} \sum_{j=1}^{m_k} \left[u(S^k_j \cup \{x_i\}) − u(S^k_j)\right], $$
where $S^k_j$ is the $j$th sample coalition in stratum $k$ and $m_k$ is the number of samples in the stratum. Grouping coalitions by size into strata allows us to optimise the number of samples taken from each stratum.
In Theorem 4.1 of [Wu23V], the authors show that the choice of $m_k$ which optimally reduces variance for a given number of samples $m$ is proportional to $\sigma_{i,k},$ the standard deviation of $\hat{v}^k(x_i)$:
$$ \tilde{m}_k = m\frac{\sigma_{i,k}}{\sum_{j=0}^{n−1} \sigma_{i,j}}, \quad k = 0, \dots , n − 1. $$
Because the variance of the estimators $\hat{v}^k(x_i)$ is unknown, the authors suggest substituting a general $f(k)$ for $\sigma_{i,k}$ and propose a few different possibilities. They pick
$$ f(k) = (k+1)^a, \quad a < 0, $$
a choice that, when plugged into the formula above, is a generalization of the optimal value obtained in [Mal14B]. 1 1 Contrary to VRDS, this is derived from a Hoeffding bound on the quality of the estimator, after solving a simple minimization problem, leading to a specific choice ofexponent $a = -\frac{2}{3},$ cf [Mal14B] Eq. (14).
Taking into account that sample size should be an integer, we can set the value of $m_k = \min(1, \lfloor \tilde{m}_k \rfloor).$ However, this implies that some samples may be left unused as $\sum_{k=0}^{n-1} m_k \le m.$ In that case, the authors sequentially increase the value of $m_k$ from $k = 0$ to $n − 1$ until the sum exceeds $m.$
We note that previously, [Cas17I] used estimated standard deviations $\tilde{\sigma}_{i,k}$ instead:
$$ \tilde{m}_{i,k} = m \frac{\tilde{\sigma}_{i,k}^2}{ \sum_{j=0}^{n-1} \sum_{l=0}^{n-1} \tilde{\sigma}_{j,l}^2}, \quad k = 0, \dots , n − 1. $$
Sadly, all of these approaches assume that the number of samples is known beforehand which is almost never the case in practice.
To empirically demonstrate that the stratified sampling algorithm for VRDS can estimate the exact data Shapley values with a variance smaller than that obtained by the permutation sampling algorithm, the authors then run a few different experiments on a plethora of datasets:
Comparison of variance of VRDS and Permutation sampling:
From Figure 2, we can clearly see that VRDS’s variance is lower than Permutation sampling. However, in many data valuation applications we are more interested in the order or ranking than in the actual value, and it is not clear whether the rank of VRDS values remains truly stable across experiments. The authors should have added an experiment to evaluate the ranking stability of VRDS, similar to what has been done in Data Banzhaf [Wan23D].
Performance of VRDS using different exponents $a:$
Figure 3 shows that VRDS consistently achieves lower variance than permutation sampling, and that $a=-1$ (black, bottom line) seems to yield the best result. It would have been interesting to see how the choice in [Mal14B] of $a=-\frac{2}{3}$ would have performed.
Removal of best and worst data groups:
The plots in Figure 4 don’t clearly show that VRDS performs better than permutation sampling for point removal tasks, and, crucially, lack confidence intervals. The similar performance could be due to the fact that even though VRDS’ variance is lower it doesn’t necessarily converge to correct values, which may indicate that the estimation is biased. It would also have been beneficial to compare it to other methods such as Data Banzhaf.
Alas, only the first experiment includes confidence intervals. Without more details and a study of the quality of the approximation, possibly using rank stability statistics, it is hard to decide on the interest of VRDS in practice. The experiments in general lack details and no code was provided to reproduce their results.
However, the use of stratified sampling seems like a worthwhile addition to the data Shapley methods as the authors wrote in their conclusion: it would be interesting to apply it to other data valuations methods that rely on sampling to estimate their values (e.g. Beta Shapley, Data Banzhaf, Truncated Monte Carlo Shapley).
If you are interested in testing out data valuation algorithms consider
using the TransferLab’s pyDVL package for data
valuation. To try it out, install with pip install pydvl
and
read the documentation.