What does it mean to define context-free clustering rules?

Hal Daume writes some reasonable things here, mocking some silly rules that have been proposed for evaluating clustering procedures.

What’s interesting to me is that such a wrongheaded idea (not Hal’s, but the stuff that he’s criticizing) could be taken so seriously in the first place.

Perhaps it’s a problem with mathematics, that it takes people to a level of abstraction where they forget their original goals. I’ve seen this a lot in statistics, for example when people devise extremely elaborate procedures to calculate p-values that don’t correspond to any actual data collection procedure. (Here I’m thinking of calculations of the distributions of contingency tables with fixed margins.)

P.S. Scroll to the end of the comments to see that Hal’s a better person than I am, in that he doesn’t waste his time cleaning out the spam from his blog comments.

5 thoughts on “What does it mean to define context-free clustering rules?

  1. I think you've got it somewhat backwards. Jon Kleinberg didn't propose rules for evaluating clustering procedures. He showed that a certain set of rules, which on their face look reasonable, are actually impossible to satisfy when taken together.

    Hal's arguing that the rules don't even look reasonable on their face. Which to me is a somewhat less interesting question to debate, just because it seems like mostly a matter of opinion.

    The moral of Kleinsberg's story is: be careful when you're choosing your rules. Which I think is exactly what you're trying to say here too.

  2. This is very close to the arguments that can be made for political redistricting schemes, like "compactness" arguments; that even if you have a bunch of different standards it's still a bit silly to automate the procedure without accounting for the original goal of solving the problem.

  3. While this particular effort looks wrong-headed, I do think there is room for someone bright to further develop the clustering foundations.

    I like to ask interview candidates how they validate cluster analysis results – it leads to a great conversation since there is no one correct way to do it, and it's easy to figure out if someone ever tried to validate their results first-hand.

  4. Ken: Exactly. I'm agreeing with Hal that the rules don't look reasonable on their face, and I'm also doubting the relevance of any rules about clustering that could be stated in such generality.

Comments are closed.