« Quotee of the Day, Sir Josiah Stamp | Main | 1200+ examples of information visualization at PIIM »
March 13, 2007
Interactions, new variables, and market segmentation
Most academic research employs basic variables that are then correlated or regressed on with outcomes of interest. These basic variables are, for example, income, state and similar. Using such variables we can claim that, on average, urban dwellers vote for Democrats and that rural dwellers vote for Republicans. Moreover, we can claim that the rich tend to vote for Republicans and the poor for Democrats. We also know that urban dwellers on average make more money than rural dwellers. To resolve the dilemma is to form a categorical variable that combines the information from both area and income, thus avoiding the need to tackle the interaction.
While one could conceive ad hoc variables such as urban poor/rich, rural poor/rich, a better answer is in making use of segmentations developed primarily for marketing purposes. Here is a visualization of the Mosaic demographic groups and types:

The groups are: A - symbols of success, B - happy families, C - suburban comfort, D - ties of community, E - urban intelligence, F - welfare borderline, G - municipal dependency, H - blue collar enterprise, I - twilight subsistence, J - grey perspectives, K - rural isolation. The types with much detail can be found in MS Word document - sociologists, do take a look. Another interesting segmentation is
PRIZM, which is also linked to ZIP codes in the US.
Some people might scoff at replacing a continuous variable with a categorical one, but remember that we can easily include also the original continuous variables in addition to the categorical type. Interactions between continuous variables are usually messy in most regression models anyway.
I really wonder what one would get correlating these descriptors to outcomes of interest, perhaps political preferences. Since these typologies are used by the media and marketing, we can expect to see opinion clustering in the political spectrum too.
Posted by Aleks at March 13, 2007 10:59 AM
Trackback Pings
TrackBack URL for this entry:
http://www.stat.columbia.edu/~cook/movabletype/mt-tb.cgi/867
Comments
We need to separate the two issues you raised: one is the "binning" of continuous variables; the other is disaggregation as a strategy to deal with interaction effects.
Binning may or may not be useful. For instance, if there is inherent measurement error in the continuous variables, then binning may reduce the noise. But binning also may reduce information. How to set up bins is a difficult problem.
Theoretically if we know which variables are interacting, we can disaggregate to the lowest level and look at those subgroups separately. In practice, this strategy is limited by the number of samples available (which will be split due to disaggregation) and the number of potentially interacting variables (which determines the number of subgroups).
In terms of marketing segmentations, they are typically built from cluster analysis, which is an unsupervised method. It's very difficult to evaluate this methodology as one can easily derive multiple different clusterings from the same data set.
Posted by: Kaiser at March 14, 2007 12:50 AM.
Kaiser, I like binning because it is the simplest way of dealing with nonlinearity, and it's OK if there is a lot of data. Just as a histogram is the first thing we look at after mean&variance, binning is the simplest thing after a linear model.
As for evaluating cluster analysis: there were also attempts to create supervised clusters. In particular, you can interpret the leaves of a CART model as supervised clusters. But unsupervised clustering is fine when you don't know what to expect as the outcome. Moreover, I have "evaluated" the segmentations with respect to how well they were able to describe their segments.
Posted by: Aleks
at March 14, 2007 10:23 AM.
The Prizm clusters and a few other such schemes have proven their worth, having survived to this day. I use them in my work.
As for binning, do you have a preference for a supervised or unsupervised method to determine bins?
Posted by: Kaiser at March 14, 2007 7:31 PM.
Regarding binning, I like to use the good old Fayyad-Irani method that's supervised. Imagine a univariate CART model, and convert each leaf into a bin.
Unfortunately, its univariate nature doesn't retain potential interactions.
Posted by: Aleks
at March 15, 2007 9:47 AM.