Why can’t I be more like Bill James, or, The use of default and default-like models

During our discussion of estimates of teacher performance, Steve Sailer wrote:

I suspect we’re going to take years to work the kinks out of overall rating systems.

By way of analogy, Bill James kicked off the modern era of baseball statistics analysis around 1975. But he stuck to doing smaller scale analyses and avoided trying to build one giant overall model for rating players. In contrast, other analysts such as Pete Palmer rushed into building overall ranking systems, such as his 1984 book, but they tended to generate curious results such as the greatness of Roy Smalley Jr.. James held off until 1999 before unveiling his win share model for overall rankings.

I remember looking at Pete Palmer’s book many years ago and being disappointed that he did everything through his Linear Weights formula. A hit is worth X, a walk is worth Y, etc. Some of this is good–it’s presumably an improvement on counting walks as 0 or 1 hits, also an improvement on counting doubles and triples as equal to 2 and 3 hits, and so forth. The problem–besides the inherent inflexibility of a linear model with no interactions–is that Palmer seemed chained to it. When the model gave silly results, Palmer just kept with it. I don’t do that with my statistical models. When I get a surprising result, I look more carefully. And if it really is a mistake of some sort, I go and change the model (see, for example, the discussion here). Now this is a bit unfair: after all, Palmer’s a sportswriter and I’m a professional statistician–it’s my job to check my models.

Still and all, my impression is that Palmer was locked into his regression models and that it hurt his sportswriting. Bill James had a comment once about some analysis of Palmer that gave players negative values in the declining years of their careers. As James wrote, your first assumption is that when a team keeps a player on their roster, they have a good reason. (I’m excepting Jim Rice from this analysis. Whenever he came up to bat with men on base, it was always a relief to see him strike out, as that meant that he’d avoided hitting into a double play.)

Bill James did not limit himself to linear models. He often used expressions of the form (A+B)/(C+D) or sqrt(A^2+B^2). This gave him more flexibility to fit data and also allowed him more entries into the modeling process: more ways to include prior information than simply to throw in variables.

What about my own work? I use linear regression a lot, to the extent that a couple of my colleagues once characterized my work on toxicology as being linear modeling. True, these were two of my stupider colleagues (and that’s saying a lot), but the fact that a couple of Ph.D.’s could confuse a nonlinear differential equation with a linear regression does give some sense of statisticians’ insensitivity to functional forms. we tend to focus on what variables go into the model without much concern for how they fit together. True, sometimes we use nonparametric methods–lowess and the like–but it’s not so common that we do a Bill James and carefully construct a reasonable model out of its input variables.

But maybe I should be emulating Bill James in this way. Right now, I get around the constraints of linearity and additivity by adding interaction after interaction after interaction. That’s fine, but perhaps a bit of thoughtful model construction would be a useful supplement to my usual brute-force approach.

P.S. Actually, I think that James himself could’ve benefited from the discipline of quantitative models. I don’t know about Roy Smalley,Jr., but, near the end of the Baseball Abstract period, my impression was that James started to mix in more and more unsupported opinions, for example in 1988 characterizing Phil Bradley as possibly the best player in baseball. That’s fine–I’m no baseball expert, and maybe Phil Bradley really was one of the top players of 1987, or maybe he’s a really nice guy and Bill James wanted to help him out, or maybe James was just kidding on that one.. My guess (based on a lot of things in the last couple of Baseball Abstracts, not just that Phil Bradley article) is simply that James had been right on so many things where others had been wrong that he started to trust his hunches without backing them up with statistical analysis. Whatever. In any case, Win Shares was probably a good idea for Bill James as it kept him close to the numbers.

11 thoughts on “Why can’t I be more like Bill James, or, The use of default and default-like models

  1. I don't think it's a coincidence that linear models are very prevalent in practice. I also use mostly linear models. I think it depends on whether the objective is to rank things or to get the best possible estimate of say the odds. In my own experience/applications, it is much more important to get the ranks right than to get the probabilities right. In fact, it's almost impossible to get the probabilities right because the outside environment is changing all the time. So all that extra complexity may not be worth much.

  2. I suppose it depends heavily on what area your models are from, but in my opinion there is a lot of room for the careful modeling approach.

    for example, recently I consulted with some biologists on their data collected on bone regrowth experiments. They initially thought of their experiment as basically we cut the bone and then we treat or we don't treat, and later we measure how big it is, and we want to see if there's a difference. By talking to them about the dynamics of bone regrowth we came to realize that there was a lot more going on. It was much easier to imagine finding differences in growth rates at different times than to find differences in the integrated growth rate over a long time (the length at the end of the experiment). Treatments could affect wound healing, the maximum growth rate, the duration of time during which growth occurred, lots of things.

    In most of my work these days I am doing physics and mechanics type modeling. There I have fundamental laws of physics that can guide me. But ultimately there is a lot of stuff I don't know, so I wind up having to create models of "effective forces" that account for a large number of unknown things (friction is an example of this). This process is as much statistical as anything else. Still, there are guiding principles.

    One of the most powerful tools that I have repeatedly advocated in comments in several locations on this blog is the use of nondimensional groups. That is, expressing the variables that affect the outcome in terms of the ratios of two similar kinds of things. An example might be to compare the density of a substance to the density of water, which is a good idea if you are talking about how substances behave at the bottom of a lake or whatever.

    The point of the nondimensional approach is that it's a way of carefully choosing the scale with which we measure things, and this is extremely important if we want to have a principled way of deciding whether things are ignorable, small, or likely to have a large effect. There is no such thing as "absolutely small" there is only small compared to something else that we think is important…

  3. Regarding brute-force adding interactions versus "thoughtful model construction", I was surprised to read that you tend to rely on the former.

    Kaiser may be right from a practical perspective that the added complexity allowed by non-linear modeling often adds little. But for me the enjoyable part of the model building/refinement process is being forced to rethink model structure and the "shape" of causal impacts based on observed lack of fit – discussing those re-thoughts with people I trust to laugh at me when I'm falling prey to the "that would explain the data, so it must be true" fallacy – and ultimately revising the model to embed a new, hopefully better, understanding.

    Of course, this approach may be a reason I'm just a guy who produces models from time to time as part of my job (or as part of my own interests when time allows)rather than one of the best known and most successful model builders in the world.

    Nevertheless, I humbly recommend to you the "thoughtful" approach, simply because I find it more enjoyable.

  4. I still don't understand why Andrew likes to take ordinal factors and treat them as scalar. For instance, taking a five-level ordinal value for income and treating it as a scalar with values -2, -1, 0, 1, and 2. That seems analogous to treating a triple as three singles in a baseball analysis.

    The other approach you see in Andrew and Jennifer's book is to treat an ordinal factor categorically. That makes more sense to me, because there's often no reason to believe the effect of a factor is monotonic, much less linear, on the ordinal scale.

    One approach to learn cut points like this is to use something like additive regression or classification trees (boosted, Bayesian additive, whatever). From the raw data, they'll fit the discretizing cut points and then blend a bunch of trees to create a non-parametric model that may have non-linear effects. One drawback is that you give up continuity in the predictions (other than in the limit), so if you really do have a linear (or other easily expressible functional relationship), you can't model it directly, only approximate it in the limit.

  5. Pete Palmer was a radar engineer at Raytheon, not a sportswriter (in the conventional sense). I'm not sure how statistical radar designing was back then, but in the context above, you're allowed to hold that against him.

  6. Daniel: I think your "nondimensional" approach is cut from the same cloth as standardizing the variables, and surely should be encouraged.

    Bob: lots of practitioners use a trick that has been proven very effective, but still a bit controversial. in the pre-processing stage, you would figure out cut points using the response variable, e.g. by building a small CART tree. Some people complain that it's cheating, or said differently, it could lead to overfitting. But then lots of people swear by it because the resulting models work well.

  7. A few weeks ago I worked on a project with some engineers and seed physiologists about drying seed corn. It was highly informative to match up my linear model to their “deterministic” model built on first principles. With a few slight modifications to the more complex interaction terms in my model we developed something that made physical sense. It was a bit of a kick in the pants that helped me realize that we can do some really neat stuff on our own, but a little understanding of the actual mechanics of the system can go a very long way.

  8. Parinella:

    Interesting background on Pete Palmer. Regarding his methodological rigidity, I've seen the same with statisticians, economists, political scientists, using a statistical procedure that's clearly flawed because they seem to think it's cheating to adapt it to circumstance.

  9. My vague recollection is that Palmer's Total Baseball was driven a little too far by the desire to come up with One Number ("Total Player Rating") for ranking players. Palmer and Co. had a good approach to hitting, a mediocre approach to fielding, but that it could sometimes go way off the rails in trying to calculate the interaction of hitting, fielding, and position.

    Obviously, a slugging shortstop is better than a glugging first baseman, but how much more valuable? I believe Total Baseball's approach was to recalculate for each year for each league's players at each position.

    So, Hal Trosky driving in 162 runs as an American League first baseman in 1936 was kind of ho-hum because Hall of Famers Gehrig, Foxx, and Greenberg made up three of the other seven AL starting first basemen.

    My analogous concern about rating teachers is that there will be strong demand for a One Number system to determine whom to fire and whom to reward before the state of the art is ready for it.

  10. Andrew,

    If the average statistician makes 96,000$ a year, and you make 76,000$ a year, and I represent you as making -20,000$ a year RELATIVE TO AVERAGE, does this mean you have "negative value"? No, it only means you have negative value relative to average.

    If the average statistician consultant makes 8,000$ a month, and I hire a statistician consultant for 6,000$ for one month, and I represent that guy as making -2,000$ relative to the average consultant, does this mean he has negative value?

    And does the first guy have more negative value than the second guy?

    This is all Pete Palmer did, to represent the performance of a player relative to the average, if the average player had the same number of opportunities as that player.

  11. Andrew,

    I agree with Tom (Tangotiger) that there was nothing wrong with Palmer making value measurements by reference to average performance. Anyone wishing to use a different reference point (such as the level of play of the typical major league part-time bench player, or the worst player who would make a major league roster) could generally take Palmer's number and make a relatively simple adjustment to get a value measurement by reference to their desired reference point. Usually this would mean simply adding about 15 or 20 runs of overall credit for each 162 games played to Palmer's estimate of runs (offense and defense combined) above or below average.

    One situation in which it is difficult to make such a simple 15-or-20-run adjustment is when, as Steve (Sailor) noticed, there are a bunch of outstanding players at one position, particularly in an 8-team league, such as the first basemen of the 1936 American League. There are still some relatively simple ways to get around even this problem.

    The real problem with Palmer's system is that the fielding runs estimates were _not_ based on any statistical model. (By the way, Palmer's offensive linear weights formulas were _not_ based on regression, but 'change-in-state' Markov chain models similar to those described in books by Tom Tango and Jim Albert.) The fielding formulas were initially just made up. (One or two components are now based on some regression results, but the effects are quite minor.) ALL of the significantly odd results in overall career player ratings under the Palmer systems have been caused by fielding runs estimates.

    My book, Wizardry: Baseball's All-Time Greatest Fielders Revealed (Oxford University Press), will be coming out next month, and will reveal the first fielding formulas for seasons since 1893 that are empirically derived from open source data and regression analysis. I look forward to discussing that model with your readers.

Comments are closed.