In the paper a new coefficient of correlation [Cha21N] the well known mathematician Sourav Chatterjee proposes a new measure of independence between random variables, beyond just measuring linear dependence like correlation coefficients. This measure has many desirable properties, among which are: simplicity, interpretability, existence of efficient estimators, and suitability for statistical tests. It should be of interest to practitioners, as quantifying to which degree one quantity can be viewed as a noiseless function of another quantity is a ubiquitous task.
The main text is rather long but very well written, providing a lot of intuition, analysis, and experiments. I highly recommend reading it, at least partially. However, the main ideas can be summarized in a few lines (as is also done by the author in the first section). The only minor “weakness” of the paper is that some properties that are obvious to seasoned statisticians are mentioned without any explanation. Therefore, in this paper pill I decided to say a few additional sentences about them.
Given scalar random variables $X$ and $Y$, the new coefficient is given by
$$ \xi(X,Y) := \frac{\int \text{Var} \left( \mathbb{E} (\mathbb{1}_{ \{ Y>t \} } \mid X) \right) d\mu(t)} {\int \text{Var} \left( \mathbb{1}_{ \{ Y>t \} } \right) d\mu(t) } , $$
where $\mu(t)$ is derived from the law of $Y$.
Now note the following properties of any pair of random variables $Z$ and $X$:
- If $Z$ is independent of $X$, then $E(Z|X) = E(Z)$, i.e. a constant r.v. taking the value $E(Z)$. Therefore, $Var(E(Z|X)) = 0$.
- If $Z$ is a deterministic function of $X$, i.e. there is some $f$ with $Z = f(X)$, then $E(Z|X) = f(X) = Z$.
- The law of total variance states: $Var(Z) = E(Var(Z|X)) + Var(E(Z|X)) ≥ Var(E(Z|X))$. Equality is reached iff Z is a deterministic function of X and the conditional variance vanishes.
With $Z := 1_{Y>t}$ we see that:
- If $Y$ is independent of $X$, the Chatterjee coefficient is zero.
- If $Y$ is a deterministic function of $X$, the coefficient becomes 1.
- The new coefficient takes values in the interval $[0,1]$
The reverses of 1. and 2. are also true, as proved in the paper. Chatterjee finds an efficient and simple estimator for $\xi$ from finite samples (computed in $O(n log(n))$ time), analyzes its asymptotic theory and overall properties. Moreover, various extensions (e.g. to multivariate random variables) and applications of it are studied. As a somehow surprising side result, the analysis based on $\xi$ sheds new light on one of the first datasets in statistics - Galton’s peas data.
Software-wise, there is an implementation in R by the author and a somehow abandoned python package. Maybe in the future the coefficient will make it into scikit-learn or other established libraries.