Default prior distributions for logistic regression coefficients

Christian Robert has some thoughts on my paper with Aleks, Yu-Sung, and Grazia on weakly informative priors for logistic regression. Christian writes:

I [Christian] would have liked to see a comparison of bayesglm. with the generalised g-prior perspective we develop in Bayesian Core . . . starting with a g-like-prior on the parameters and using a non-informative prior on the factor g allows for both a natural data-based scaling and an accounting of the dependence between the covariates. This non-informative prior on g then amounts to a generalised t prior on the parameter, once g is integrated.

This sounds interesting. I agree that it makes sense to use a hierarchical model for the coefficients so that they are scaled relative to each other.

Regarding the pre-scaling that we do: I think something of this sort is necessary in order to be able to incorporate prior information. For example, if you are regressing earnings on height, it makes a difference if height is in inches, feet, meters, kilometers, etc. (Although any scale is ok if you take logs first.) I agree that the pre-scaling can be thought of as an approximation to a more formal hierarchical model of the scaling. Aleks and I discussed this when working on the bayesglm project, but it wasn’t clear how to easily implement such scaling. It’s possible that the t-family prior can be interpreted as some sort of mixture with a normal prior on the scaling.

In any case, maybe Aleks can try Christian’s model on our corpus and see what happens. Christian links to his code, which would be a good place to start.

3 thoughts on “Default prior distributions for logistic regression coefficients

  1. Mathematical models using Nondimensional groups are a common way to reduce the order of a model. A typical example from engineering is to define the pressure loss in pipe flow in terms of a coefficient called the friction factor which is a function of the reynolds number and the relative roughness of the pipe… Both reynolds number and relative roughness are the ratios of two things that have the same units, and therefore are unitless…

    So in your example of the regression of vote preference on age and some discrete variable, you mention that the scale for age is important. To eliminate it, one way to do it is to standardize age (measured in some units) against some standard (measured in the same units) such as median age of the population at large, or life expectancy at birth, or even something arbitrary that doesn't have to be estimated from the data, such as age at which one can first vote…

    In physical sciences we often can find groups of parameters which when combined can form a nondimensional ratio, for example the reynolds number is rho * V * L / mu. rho is density (kg/m^3), V is velocity (m/s), L is some "characteristic length" (m) and mu is viscosity (kg/m/s). It is possible that you could occasionally find such groups in social science settings as well. Perhaps income * age / wealth

    or

    distance driven daily for commute / (average commute speed * wealth / income )

    By formulating your model in terms of these nondimensional groups, your model becomes valid no matter how you measure your variables, as long as you use (or convert to) consistent units for measuring each of the variables.

    Also, putting a prior distribution on a nondimensional group can often be much easier. For example, if I am trying to estimate the effect of age which I have standardized as age(yrs)/18(yrs) then I can pretty easily see that age is going to have the range of 1 to 6 or so, and that therefore a reasonable mildly informative prior might be N(0,3) since 3*5 = 15 and on the logit scale 15 is pretty big.

    Hope that helps somehow.

  2. In our paper we took care of scaling in a very radical way: all continuous variables were discretized and only took values of 0 or 1.

    I agree that the problem of scaling as well as the problem of inter-predictor correlations are important, and I'm looking forward to seeing how this is handled in Bayesian Core. A PDF of the relevant chapter sent via email would be helpful, as I'll forget about the problem by the time I actually put my hands on the book.

    While the models are all fine, the challenge is to implement the model in a robust and efficient fashion so that it would survive the brutal testing on the corpus. I'll try to put the code and data out there so that others can make their code sufficiently robust.

  3. Just to clarify: In the cross-validation example of our paper, we used only binary predictors. In the general method of our paper, and in the other examples in the paper, we used some binary predictors and some continuous predictors, which we rescaled by subtracting the mean and dividing by two standard deviations.

Comments are closed.