Robust Estimation of a Location Parameter, Peter J. Huber. The Annals of Mathematical Statistics(1964)


This paper contains a new approach toward a theory of robust estimation; it treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators--intermediaries between sample mean and sample median--that are asymptotically most robust (in a sense to be specified) among all translation invariant estimators. For the general background, see Tukey (1960) (p. 448 ff.) Let x1, ⋯, xn be independent random variables with common distribution function F(t - ξ). The problem is to estimate the location parameter ξ, but with the complication that the prototype distribution F(t) is only approximately known. I shall primarily be concerned with the model of indeterminacy F = (1 - ε)Φ + ε H, where $0 \leqq \epsilon < 1$ is a known number, Φ(t) = (2π)-1/2 ∫t -∞ exp(-1/2s2) ds is the standard normal cumulative and H is an unknown contaminating distribution. This model arises for instance if the observations are assumed to be normal with variance 1, but a fraction ε of them is affected by gross errors. Later on, I shall also consider other models of indeterminacy, e.g., $\sup_t |F(t) - \Phi(t)| \leqq \epsilon$. Some inconvenience is caused by the fact that location and scale parameters are not uniquely determined: in general, for fixed ε, there will be several values of ξ and σ such that $\sup_t|F(t) - \Phi((t - \xi)/\sigma)| \leqq \epsilon$, and similarly for the contaminated case. Although this inherent and unavoidable indeterminacy is small if ε is small and is rather irrelevant for practical purposes, it poses awkward problems for the theory, especially for optimality questions. To remove this difficulty, one may either (i) restrict attention to symmetric distributions, and estimate the location of the center of symmetry (this works for ξ but not for σ); or (ii) one may define the parameter to be estimated in terms of the estimator itself, namely by its asymptotic value for sample size n → ∞; or (iii) one may define the parameters by arbitrarily chosen functionals of the distribution (e.g., by the expectation, or the median of F). All three possibilities have unsatisfactory aspects, and I shall usually choose the variant which is mathematically most convenient. It is interesting to look back to the very origin of the theory of estimation, namely to Gauss and his theory of least squares. Gauss was fully aware that his main reason for assuming an underlying normal distribution and a quadratic loss function was mathematical, i.e., computational, convenience. In later times, this was often forgotten, partly because of the central limit theorem. However, if one wants to be honest, the central limit theorem can at most explain why many distributions occurring in practice are approximately normal. The stress is on the word "approximately." This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): What happens if the true distribution deviates slightly from the assumed normal one? As is now well known, the sample mean then may have a catastrophically bad performance: seemingly quite mild deviations may already explode its variance. Tukey and others proposed several more robust substitutes--trimmed means, Winsorized means, etc.--and explored their performance for a few typical violations of normality. A general theory of robust estimation is still lacking; it is hoped that the present paper will furnish the first few steps toward such a theory. At the core of the method of least squares lies the idea to minimize the sum of the squared "errors," that is, to adjust the unknown parameters such that the sum of the squares of the differences between observed and computed values is minimized. In the simplest case, with which we are concerned here, namely the estimation of a location parameter, one has to minimize the expression ∑i (xi - T)2; this is of course achieved by the sample mean T = ∑i xi/n. I should like to emphasize that no loss function is involved here; I am only describing how the least squares estimator is defined, and neither the underlying family of distributions nor the true value of the parameter to be estimated enters so far. It is quite natural to ask whether one can obtain more robustness by minimizing another function of the errors than the sum of their squares. We shall therefore concentrate our attention to estimators that can be defined by a minimum principle of the form (for a location parameter): T = Tn(x1, ⋯, xn) minimizes ∑i ρ(xi - T), \begin{equation\*} \tag{M} where \rho is a non-constant function. \end{equation\*} Of course, this definition generalizes at once to more general least squares type problems, where several parameters have to be determined. This class of estimators contains in particular (i) the sample mean (ρ(t) = t2), (ii) the sample median (ρ(t) = |t|), and more generally, (iii) all maximum likelihood estimators (ρ(t) = -log f(t), where f is the assumed density of the untranslated distribution). These (M)-estimators, as I shall call them for short, have rather pleasant asymptotic properties; sufficient conditions for asymptotic normality and an explicit expression for their asymptotic variance will be given. How should one judge the robustness of an estimator Tn(x) = Tn(x1, ⋯, xn)? Since ill effects from contamination are mainly felt for large sample sizes, it seems that one should primarily optimize large sample robustness properties. Therefore, a convenient measure of robustness for asymptotically normal estimators seems to be the supremum of the asymptotic variance (n → ∞) when F ranges over some suitable set of underlying distributions, in particular over the set of all F = (1 - ε)Φ + ε H for fixed ε and symmetric H. On second thought, it turns out that the asymptotic variance is not only easier to handle, but that even for moderate values of n it is a better measure of performance than the actual variance, because (i) the actual variance of an estimator depends very much on the behavior of the tails of H, and the supremum of the actual variance is infinite for any estimator whose value is always contained in the convex hull of the observations. (ii) If an estimator is asymptotically normal, then the important central part of its distribution and confidence intervals for moderate confidence levels can better be approximated in terms of the asymptotic variance than in terms of the actual variance. If we adopt this measure of robustness, and if we restrict attention to (M)-estimators, then it will be shown that the most robust estimator is uniquely determined and corresponds to the following ρ:ρ(t) = 1/2t2 for $|t| < k, \rho(t) = k|t| - \frac{1}{2}k\^2$ for |t| ≥ k, with k depending on ε. This estimator is most robust even among all translation invariant estimators. Sample mean (k = ∞) and sample median (k = 0) are limiting cases corresponding to ε = 0 and ε = 1, respectively, and the estimator is closely related and asymptotically equivalent to Winsorizing. I recall the definition of Winsorizing: assume that the observations have been ordered, x1 ≤ x2 ≤ ⋯ ≤ xn, then the statistic T = n-1(gxg + 1 + xg + 1 + xg + 2 + ⋯ + xn - h + hxn - h) is called the Winsorized mean, obtained by Winsorizing the g leftmost and the h rightmost observations. The above most robust (M)-estimators can be described by the same formula, except that in the first and in the last summand, the factors xg + 1 and xn - h have to be replaced by some numbers u, v satisfying xg ≤ u ≤ xg + 1 and xn - h ≤ v ≤ xn - h + 1, respectively; g, h, u and v depend on the sample. In fact, this (M)-estimator is the maximum likelihood estimator corresponding to a unique least favorable distribution F0 with density f0(t) = (1 - ε)(2π)-1/2e-ρ(t). This f0 behaves like a normal density for small t, like an exponential density for large t. At least for me, this was rather surprising--I would have expected an f0 with much heavier tails. This result is a particular case of a more general one that can be stated roughly as follows: Assume that F belongs to some convex set C of distribution functions. Then the most robust (M)-estimator for the set C coincides with the maximum likelihood estimator for the unique F0 ε C which has the smallest Fisher information number I(F) = ∫ (f'/f)2f dt among all F ε C. Miscellaneous related problems will also be treated: the case of non-symmetric contaminating distributions; the most robust estimator for the model of indeterminacy $\sup_t|F(t) - \Phi(t)| \leqq \epsilon$; robust estimation of a scale parameter; how to estimate location, if scale and ε are unknown; numerical computation of the estimators; more general estimators, e.g., minimizing $\sum_{i < j} \rho(x_i - T, x_j - T)$, where ρ is a function of two arguments. Questions of small sample size theory will not be touched in this paper.