Frequencies of first and last names: a good example to illustrate the assumption of statistical independence and its problems

Gueorgi pointed us to this website on name frequencies. Gueorgi writes,

Given a first and last name, it estimates the number of people in the US with the same name. They take the data from the 1990 Census and make an assumption that the first and last name are uncorrelated. There is a brief section on accuracy here. It might be a bit silly, but at least provides an easy way to look up of Census name frequencies (assuming their scripts work correctly). >From a research perspective, if such a website proves popular, perhaps one could use the same basic idea and produce better estimates by including first x last name correlation, and maybe add the functionality to collect user data like basic demographics, etc. to use with “how many x’s you know” surveys.

Wow, the “>From” in his email really takes me back . . .

Anyway, for first names, I prefer the Baby Name Voyager, which has time series data and cool pink-and-blue graphics, but it is convenient to have the last names too. By assuming independence, I think this will overestimate the people named “John Smith” and underestimate the people named “Kevin O’Donnell,” (I once looked up John Smith in the white pages and found that, indeed, it’s less common than you’d expect from independence. Which makes sense, since if you’re named Smith, you’ll probably avoid the obvious “John.” Unless it’s a family name, or unless you have a sense of humor, I suppose.)

But Matt comments:

Also, I think this might be a good tool for teaching undergrads. In my class we just covered the basic rules of probability and I tried to get across the idea of independence of events. A name like Jose Cruz provides a good examples of things that are not independent.

I’m down with that. And it could be a cool class project to do some checking of phone directories. The violation of independence is reminiscent of the dentists named Dennis.

2 thoughts on “Frequencies of first and last names: a good example to illustrate the assumption of statistical independence and its problems

  1. It would be interesting to come up with a combination that gave an obviously wrong answer — one where both the first name and last name are very common, but the combination is very rare. For instance, if the estimate was for 50,000 people named Juan Huang, you would know that was wrong. But it says there are 29 people with that name, and I could almost believe that.

Comments are closed.