Categorizing continuous variables

Posted on November 22, 2006 7:01 AM by Andrew

Jose von Roth writes,

I am working on a project where I have to create some logistic models.

I am trying to categorize continuous variables in the most efficient way. Do you know of some rule of thumb I could apply based on the total n?

My sample size is relatively small, 230 cases, each case has 7 variables, 3 of them are continuous (which have to be categorized). Because of the sample size, I found as recommendation, to boostrap the data let’s say 1000 times, run logistic models for each of those 1000 samples, and compute the average beta coefficients based on the average P values. And use those means as my final beta coefficients. Is that a valid approach to improve the correctness of the models?

Once I have my logistic models, can I rely on the accuracy rates if I test them on let’s say another 1000 boostrapped samples (with replacement)? Or is there any other validation technique which you would suggest?

Finally, the idea of this project is to classify cases into 5 categories (calculating the Probability that “x” case belongs to “y” category). I was considering using multicategory logistic models, but I am not sure if using single dichotomous logistic models would be better and then create 5 of them. Which approach is statistically more accurate? I get overall different accuracy rates, where multicategory is always much higher than 5 dichotomous models.

My response: first, it’s not clear whether you’re trying to predict these 7 variables or use them as predictors. In either case, I don’t see why you want to divide the continuous variables into categories. I’d recommend just keeping them as continuous. Or, if you do divide a predicted variable into 5 cagetories, I’d start with simple linear regression, only using the more complicated ordered categorical models if you really have to (as revealed, for example, by residual plots).

3 thoughts on “Categorizing continuous variables”

Gregor on November 22, 2006 7:57 PM at 7:57 pm said:

I guess that the following says a lot about categorizing continious variables.

## install.packages(pkg="fortunes")
library(fortunes)
fortune("murder")

Beside that it is not simple to categorize. Which treshold shall I take? Is it 50 if x is between 0 and 100? And later your estimates can not be so "precise" with logistic model since all values of x between 0 and 50 are treated as the same.
John Bullock on November 23, 2006 7:23 AM at 7:23 am said:

Frank Harrell (of R packages "Hmisc" and "Design") has a useful web page full of criticisms about categorization of continuous variables.

http://biostat.mc.vanderbilt.edu/twiki/bin/view/M…
Jose von Roth on November 24, 2006 9:52 AM at 9:52 am said:

Hello Prof Gelman,
those 7 variables are supposed to be used as predictors. Well actually I was planing not to categorize the continuous variables, but it is part of my project.

Thanks,

Jose

Comments are closed.