A Fast Hybrid Algorithm for Large Scale L1-Regularized Logistic Regression

Posted on April 6, 2010 12:30 PM by Andrew

Aleks (I think) sent me this link to a paper by Jianiang Shi, Wotao Yin, Stanley Osher, and Paul Sajda. They report impressive computational improvements based on “a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate.” This looks awesome. And they even test their method on the UCI database, which is what Aleks used in the cross-validation studies for bayesglm.

Here are my questions:

1. Could their algorithm work with other regularizations? L1 corresponds to a folded-exponential (Laplace) family of prior distributions? I prefer the t model. We implemented the t in bayesglm using an adaptation of the standard weighted least squares algorithm. That’s fine for our little examples, but for big problems, where speed is a concern, maybe this new method is much better I imagine it could be adapted to our family of prior distributions, but I don’t know enough about the method to be sure.

2. What about hierarchical models? That’s the next thing to do. lmer/glmer are running pretty slow for us.

P.S. The funny thing is that Shi and Sajda are at Columbia (at the biomedical engineering department). But, to the best of my knowledge, we’ve never met. I don’t even know where their office is located. According to their addresses in the article, they have a 10025 zipcode. But the engineering departments I’ve been to are all in 10027.

3 thoughts on “A Fast Hybrid Algorithm for Large Scale L1-Regularized Logistic Regression”

Janne Sinkkonen on April 8, 2010 1:33 AM at 1:33 am said:

BTW, how is the convexity of posterior maximization with Cauchy prior? Some people over here think it is not convex, while L1 still is. Have you observed local maxima, and if yes, are they a practical problem?
Andrew Gelman on April 8, 2010 5:38 AM at 5:38 am said:

Janne: I don't know. I haven't thought too much about this particular issue.
Ted Dunning on April 15, 2010 2:25 PM at 2:25 pm said:

One of the huge advantages to the L_1 regularizer is that in the case of sparse data, you can preserve sparse updates. This allows training on problems with 10^7 features (or more) in reasonable amounts of time.

The reason that this works is that you can do the regularization gradient steps lazily. You only really need to do the steps when you see the corresponding feature.

For text-like data, this is a really big deal since it can result in many orders of magnitude speed up in training.

Check out the vowpal wabbit pages some time.

Comments are closed.