Variational minimax estimation of discrete distributions under KL loss

Liam Paninski

To appear,
Neural Information Processing Systems 2004

We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number $m$ of points, given $N$ i.i.d.~samples. Our upper bounds are approximation-theoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are based on Bayesian averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as $c = N/m \to \infty$ (no matter how slowly), and shows that no estimator is consistent for $c$ bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as $c\to 0$ or $\infty$, and are shown numerically to be tight within a factor of two for all $c$. Finally, in the sparse-data limit $c\to 0$, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like $-c\log(c)$ optimizes both the upper and lower bounds, suggesting an optimal choice of the ``add-constant'' parameter in this regime.
Reprint (90K, pdf)  |  Related work on estimation of information-theoretic quantities  |  Liam Paninski's research page