Specifying a distribution from the mean and quantiles, or, just in case you thought this blog was nothing but square footage and Starbucks

David Kane writes,

What is the best way to simulate from a distribution for which you know only the 5th, 50th and 95th percentile along with the mean? In particular, I want to estimate the value for a different percentile (usually around the 40th) and associated confidence interval. I assume that the distribution is “smooth” and unimodal. For background, see here.

If you don’t want to read all that, the short version is that I want to see if socioeconomic diversity has increased at Williams College over the last decade. (You may be interested in the same thing about Columbia.) It isn’t easy to measure “inequality,” of course, so for starters I just want to estimate what has happened at the 20th percentile. Williams has about 2000 students. So, I want to estimate the family income of the 400th poorest family.

Williams only has data on students who request financial aid. But that covers almost all the families in the bottom 1/3 of the distribution. Williams, like most colleges, does not want to give out much data. However, recent debate in Congress has resulted in Williams and other rich schools publishing some relevant data. Unfortunately, it isn’t exactly what I want, hence my question.

To be concrete, Williams tells us, for each year since 1998, how many students are on aid and what the mean and the 5th, 50th and 95th percentiles of family income are for those students. But the number of students on aid has increased so the location of the 40th percentile for the entire student body (not just those on aid) is in a different location in the aided students distribution each year.

My reply:

If you were given only two quantiles, I’d recommend that you just pick a reasonable 2-parameter distributional family, solve for the two parameters, and go from there (and do a sensitivity analysis considering other families). With 3 quantiles to fit, I’d say to take a 3-parameter skewed family (although I’m not quite sure what I’d actually use). But 3 quantiles and a mean . . . fitting to a 4-parameter family seems silly, and fitting to a 2-parameter or 3-parameter family using least squares doesn’t sound quite right either.

The right thing to do, I think, is to have some model over distribution space, probably centered on some reasonable three-parameter family but with error. I’m not quite sure the best way to do this; maybe work with the cdf and transform the uniform. I wouldn’t be surprised if there’s a reasonable solution out there; it seems like a fun problem to work on.

Or, if I wanted an answer and was in a hurry, I’d try various curves that go thru the 5th, 50th, 95th and then play around until they match the mean correctly.

6 thoughts on “Specifying a distribution from the mean and quantiles, or, just in case you thought this blog was nothing but square footage and Starbucks

  1. It seems like you should be able to make use of the fact that E(x)=int_0^1 F^{-1}(p)dp, where F is the CDF of a random variable. This you have estimates of F^{-1}(p) for three values, and the mean. Use a sum to approximate the integral (yeah, I know there are only 3 quantiles, but it might work), then maybe solve a for a fourth quantile? Then, once you solve for the fourth quantile, maybe fit a CDF to the four quantiles instead of the mean and three quantiles? Seems like that would be easier.

  2. David might be able to take advantage of the fact that incomes are often modeled with Pareto distributions, so he may just need to adjust parameters to fit his statistics. Fun problem, yes.

  3. John Cook:

    Thanks for the link to this program! This looks interesting.

    (modesty probably prevented you for noting that you were the person who implemented the algorithms in Visual C++, so I'll do it for you)

  4. Interesting suggestions all around. I took the approach Andrew criticized, fitting a two-parameter family via least squares. After looking at the data, it's actually exceedingly hard to find any distributional family that works. The lognormal doesn't, nor does the Pareto, nor any gamma, nor an extreme-value distribution… they don't even come close! The best I could find was a truncated normal. I put up some code to do it at:

    http://www.politistats.com/code/dave.R

Comments are closed.