Towards a statistical theory of data selection under weak supervision

30 Oct 24 16:00 UTC

Attributions of value to training samples can be used to examine data, improve data acquisition, debug and improve models or compensate data providers. Recent developments in the field enable principled and useful definitions of value which …

Pulkit Tandon is a Research Engineer at Granica, where he focuses on data selection, compression, and optimization for AI applications. He also serves as an Adjunct Lecturer at Stanford University, teaching courses on data compression. He holds a Ph.D. in Electrical Engineering from Stanford, with prior experience at Netflix working on video encoding technologies. His expertise lies in statistical learning, machine learning, and optimization in both academia and industry.

Pulkit Tandon, research engineer at Granica, will present his work on data selection, showing how using surrogate models to select subsamples of a data set for labeling can improve training efficiency and performance.

Abstract

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples ${{\boldsymbol x}_i}$ ${i\le N}$, and to be given access to a `surrogate model’ that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by ${{\boldsymbol x}_i}$ ${i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.

References

[Kol23S]

Towards a statistical theory of data selection under weak supervision, Germain Kolossov, Andrea Montanari, Pulkit Tandon.

Oct 2023

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n

Publication

Abstract

References

In this series →