Baseball stats: innovation, randomness, and other issues

Ubs writes:

I wonder if I can get your thoughts on sabermetric baseball stats. Basically I’m trying to think about them more intelligently so that my instinctive skepticism can be better grounded in real science. There’s one specific issue I’m focusing on, but also some more general stuff.

(1) First of all, my sense of the sabermetric community is that the big name guys at top inventing the various stats, guys like Nate Silver and Tom Tango, really are smart about statistics and know what they’re doing. Moving down from there, my sense is that among the subworld of SABR-geeks who like to look at the numbers and play fantasy baseball and all that, some of them understand the concepts pretty well and others not so much. My experience in discussion forums is that a lot of the baseball-stats-geeks are enamored with the idea that everything can be measured exactly so they tend to have little patience with uncertainty. These are the guys who will say, “No, you’re wrong, Sal Pecadillo is better than Goober Green. I looked it up on FanGraphs, and Sal’s xFPQAA is 3.582 and Goober’s is only 3.421, so that proves it.” (And then of course all the SABR-geeks are still a step removed from the masses who don’t do sabermetrics at all and are still looking at batting averages and RBIs.)

One thing I worry about is that now that there’s a thriving sabermetric community and the data is easily available, it’s that much easier for someone who *doesn’t* really know what he’s doing to step in and make up a stat and get it “published” whereupon it might get picked up by others who don’t really know either. It seems like they’re inventing some new number every day, and I can’t help wondering if the guy who cooked it up really is statistically informed or if he just crunched a bunch of data and makes claims for his formula that perhaps it can’t support. I guess it’s the same idea as the Internet leading to more typos or factual errors. It’s not that everyone got dumber, it’s just that in the old days there was enough of a hump you had to get over to get published at all that those who did manage to get published were more likely to actually be accurate, whereas now anyone can publish so the inaccurate people are diluting the whole.

I guess that’s not a question, but I’m wondering what you think of that generally.

(2) I’m wondering about the conceptual basis for sabermetric stats. I’m not well educated on statistics, and I probably use a lot of terms wrong. I don’t even really know exactly what “Bayesian” means, but I do get the general idea that it’s a way of looking at probability that says there is some underlying “true” probability that exists, and the closer we look at the data that results from that probability, the closer we can estimate what the underlying probability is. (Is that the right general idea, or am I way off?)

It seems to me that baseball stats have changed in what they are trying to accomplish. They used to be counting stats, but with the new saber guys they’re trying to be Bayesian. In the old days, if you counted up someone’s RBIs, say, there is no probability involved: he really did drive in that many runs. And likewise with batting average: there’s no probability involved, Willie Mays really did get a hit 30.2% of the time.

But it seems to me that the saber guys aren’t trying to measure what *happened*. As a practical matter, they are attempting to predict what will happen next. So, for example, they are interested in how likely it is the guy will get on base in the next at-bat. Therefore they want to look at his OBP, which measures how often he got on base before, but deep down they’re not interested in it as a record of what happened, but only insofar as it is an indicator of what will happen.

But I often sense that it’s more than just prediction. The stats-geeks I encounter in discussion seem to be driven by a desire to measure how good each player is. But they aren’t satisfied by measuring what he has actually achieved (which might be due to “luck”); rather, they want to know how good he “really is”. This is where it starts to feel “Bayesian” to me, as I understand the idea. They are postulating that there is some underlying measure of how good a player that guy is, and all the new formulas they invent are efforts to capture that underlying truth.

Every actual stat is still descriptive, of course. Even if they invent some fancy formula that corrects for ballpark, normalizes against the league average, regresses to some predefined mean, etc, etc, it’s still taking real data of past events and performing some computation on it. But the way they write the formula to define that computation seems to be part of an ongoing quest to find the imaginary number that says how good the guy “really” is.

Hm, I guess that one’s not a question either.

(3) I’m wondering about uncertainty. I know that in polling there’s always a margin for error, but that’s kind of different, because in that case you’re counting things and there’s a real answer that you’re trying to derive from a sample. If there are a million events that can be out or not-out, you could conceivably count them all up and calculate a correct on-base percentage. Or if you don’t have time to compile it, you can count up 1,000 randomly selected ones instead and estimate based on that. That would give you a margin of error of the ordinary kind that I understand.

But it’s not like that in baseball. Some guy we’re looking at actually has only, say, 300 at-bats, and we really do have every one of them and can count them all. But we’re not trying to measure what his actual OBP really was; we’re trying to guess what his chances of getting on-base are for the future at-bats. So how do you measure the uncertainty then? I know it exists. I hear about “small sample size” all the time. If a guy has 1,000 at-bats his past OBP is a better predictor of future OBP than if he has only 20 at-bats. That is pretty intuitive, too.

I know that in this predictive sort of sampling, a large sample is always going to be better than a small sample. What I’m wondering is how we measure the relationship between the sample size and the certainty. Is there any basis for attaching an actual margin of error to an actual number of data points? Could I say that if I have 1,000 at-bats to look at that lets me state a guy’s OBP with a margin of error of +/- X points at confidence level 95%, and then if I have even more at-bats I could make X lower?

The other thing I’m wondering, and we’re finally getting close to my specific example, is whether different formulas yield different levels of certainty. As you know, the SABR guys keep inventing fancier and fancier formulas in their attempt to better ascertain some player’s “true” skill level (which I guess would be your Bayesian “prior”, even though it isn’t really prior). If they look at more different factors to try to achieve that, does that put a greater demand on the data in terms of sample size required to reach the same level of certainty?

OK, here’s my example. There are two pitching stats out there which attempt to improve on ERA. Whereas ERA just measures how many runs the pitcher actually did give up. FIP and tRA attempt to isolate the pitcher’s actual pitching skill from external factors like how well the defense played, ballpark factors, and of course “luck”. The two stats take a similar approach but one is more complicated than the other. FIP categorizes every at-bat into four categories: home run, walk, strike out, or ball in play. It then runs it through a formula that weights each of those according to how many runs that particular event creates on average (which I assume is in turn derived from universal data over the whole league), so that the resulting number is parallel to ERA but presumed a better measure of the pitcher’s actual skill.

tRA does the same thing, but it further subdivides the balls in play to ground balls, fly balls, and line drives, weighting each of those individually. Fans of tRA like to characterize it as “more accurate” than FIP, because it’s more precise in its analysis. My intuition balks at that because it feels like the greater subdivision of results ought to carry with it a greater uncertainty in the weighting factor. Imagine we could subdivide the balls in play with even greater precision. Suppose we had the exact speed, trajectory, and initial position of every ball put in play. We could categorize those as finely as we choose, up to a maximum precision wherein every at-bat is in a single category of one. But if we do that then our weighting factors for those categories are the actual result of that one at-bat, at which point we’re back to measuring what actually happened with all the “luck” it entails, bringing us right back to the ERA that we were trying to avoid in the first place.

My reply: Wow–there’s someone named Tom Tango? That’s a great name, in the Vance Maverick or Larry Lancaster level of greatness, I’d say.

OK, now back to the questions, in order:

1. In any field, you need innovators and you need people to apply the innovations. From that perspective, I don’t think it’s so horrible that some poorly-informed people are obsessing on the latest update to on-base percentage or whatever. From general numeracy considerations, though, I completely agree that the third decimal place can’t be so important. That’s the kind of lexicographic-ordering thinking that I hate, and it appears in so many different contexts (for example, when people want “the best” of anything without thinking about dollar cost, opportunity cost, etc.).

But, if people are going to be innumerate, I’d rather they be innumerate about on-base percentage than about batting average.

2. There’s often a fuzzy line between (a) making inferences, and (b) simply “measuring what happened.” I mean, what’s a “save”? For that matter, what’s a “hit”? Etc. These definitions are constructed to be relevant for inferential questions about players’ abilities and contributions to the team.

One way things are changing is that there’s a ton of raw, raw data–locations of where every ball landed on the field, things like that. In that case, the steps going from raw data to inference are going to be more apparent. With old-fashioned statistics such as batting and fielding averages, it can be easier to fool yourself into thinking of them as pure measurement.

Regarding your question/comment about abilities vs. achievements: this is something I’ve talked about on occasion with my collaborator Hal Stern. To take an extreme example, suppose two teams play in the World Series, and Team A is obviously better, both in the regular season (as measured by won-lost record and maybe also by Bill James-like aggregates of individual statistics) and even in the series (hitting better, pitching better, and scoring many more runs than Team B), but Team B still manages to eke out a 4-3 series victory on the basis of a couple of 5-4 and 3-2 victories achieved by lucky bloopers into center field when there happened to be a guy on second base, etc. I.e., luck. Well, it doesn’t matter, right? Team B is the winner.

In evaluating players, though, you want to factor out the luck, as well as you can–especially if your goal is to evaluate how well the player will perform in future years. So I think it’s important to make your inferential goal clear (which I think Bill James has been pretty good at, in general). This sort of thing might help resolve some of your specific questions about tRA etc.

P.S. I’ve been told that Tom Tango is not his real name. Bummer.

14 thoughts on “Baseball stats: innovation, randomness, and other issues

  1. When reading baseball statistical arguments, I get the sense that no one has still managed to come up with a good way to measure the defensive skills of position players. Probably for that reason, the importance of defense at non-pitching positions tends to be underrated.

  2. As a baseball fan (my team, the Seattle Mariners, employs Tom Tango) I totally agree with Ubs' observations about the current oft-misguided fascination with statistics in baseball. And it really makes me glad that people like Ubs exist who have the curiosity and initiative to seek expert advice on these topics.

    It has always struck me that a large proportion of statisticians are baseball fans. And I feel like baseball is, more and more, becoming the reason that some choose to pursue a study of statistics (whether at the basic or advanced level).

    And finally, I would add in response to Ubs' question 3 on margin of error: Yes, you can. There is a basis for quantifying the amount of error, although it gets complicated very quickly. In the case of your question it might behoove you to start on a basic level by looking at the "t" distribution, although I would encourage you to go with your curiosity and take a statistics class or two, which will give you a much more complete and satisfying understanding.

  3. To address a few of the themes in the discussion (Bayesian-ness and fielding stats) – I can point you to a great technical article by Shane Jensen, Ken Shirley and Abraham Wyner in the latest Annals of Applied Statistics – which fits a Bayesian hierarchical model on fielding, based on some very detailed data.

    This paper got some press for its conclusion about how bad Derek Jeter's fielding really is, a neverending source of debate between sabr-heads and old-schoolers.

    see also http://stat.wharton.upenn.edu/~stjensen/research/

  4. Wow, baseball! I work for an MLB team and I'm also a Bayesian (and I own all of your textbooks). Baseball is getting more sophisticated, but raising the issues of risk and uncertainty is demanding a lot from the average person. In baseball, like in many other fields, people want *the* answer, not something they perceive as wishy-washiness or hand-waving.

  5. But, if people are going to be innumerate, I'd rather they be innumerate about on-base percentage than about batting average.

    Why do you prefer this? It's not that more difficult to compute OBP (same concept, divide success by opportunities), and it seems to balance positive events (hits, walks, HBP's) vs. negative ones (outs) better.

  6. Ed – have you ever been to the Fangraphs site? One could argue that some stat-types are starting to overvalue defense (or at least, it could be argued that people overestimate the reliability of the defense stats we have at this time). Look up league leaders on Fangraph, go to the "value" section, and see who ends up top. Or google for their article on Adam Dunn vs. Nyjer Morgan.

    At any rate, defense stats are getting better and they definitely receive a lot of attention.

  7. This is a good discussion. Couple of stray comments:

    About tRA. I don't think there is a problem with too much subdivision of the data. There are tens of thousands of ground balls, fly balls, and line drives every year. The issue is something else: ground balls given up by one pitcher may be consistently easier or harder to field than the average. Ditto for line drives and fly balls. As well, some pitchers are better with runners on base, so they give up fewer runs than you would expect just looking at the average result of an at-bat.

    Tom Tango: Great analyst, but I'm pretty sure the name is a pseudonym.

    "Pure measurement": Some baseball stats are indeed pure measurements. Wins and runs, for example. Hits, not quite, because the official scorer subjectively assigns errors. For any particular play, a lot of people are involved — pitcher, batter, fielders, umpires — but the result can be described exactly.

    Measuring defensive skills: there's a lot of progress to be made, but UZR (at Fangraphs) is quite good. Ordinary Zone Rating (at ESPN.com) isn't half bad either.

  8. What struck me in the original questioner's essay was that he has the cart before the horse. A working sabermetrician's goal is not to make predictions. Predictions are part of the hypothesis testing process. It's just the scientific method: Look at data and make a hypothesis. Then test the hypothesis, tweak the hypothesis, test again, repeat. So you have the data and come up with a method of, say, evaluating defense. You make predictions based on that hypothesis. Then you wait for a season or more and check to see how accurate your predictions were and where they were off by the most amount. You correct the hypothesis as best you can, and repeat.

    Readers, of course, treat the predictions as a goal. What they want is not an accurate method after some time. They want to know how their favorite teams and players are going to do next year. So a working sabermetrician, simply to get an audience, has to cater to that. But it's not actually the goal of the sabermetrician's work. That goal is to persevere with the scientific method, getting, hopefully, better and better results. Predictions are just the first step in testing an existing hypothesis.

    As for accuracy, the main issue, right now, is standards. To make up an extreme fictional example, let's say you have a method for evaluating offense that gets every Cardinal exactly right, except that it predicts Albert Pujols to hit .204, with 4 homers and 24 RBI, all season. Another method gets everyone off by 2%. Which method is better? By simple standard deviation, it's the first one. The prediction for Pujols is terrible, but it's the only error. Any sane statistician, though, is going to choose method 2, which means he has to come up with some standard more advanced than simple standard deviation. This is where you get competing methods, both claiming the best accuracy: competing standards.

    In fact, one of the best ways to choose a "favorite" sabermetric method is to see which sabermetricians publish their standards as well as their methods, and go with those guys. The guys who don't publish standards may claim accuracy, but you don't know what standard they are using to make that claim. The guys who don't even bother to publish their methods are asking you to take their results on faith. Don't. Statistics are not faith. Encourage that sabermetrician to do better.

    – Brock Hanke

  9. Your own example about factoring out luck shows how difficult that might be. What is the difference between a "lucky blooper" and a skillful one? One say where a batter manages to get a wicked sinker in the air just enough? Even if the batter never did it before, against a given pitcher that might an amazingly skillful hit. You can always add more information here but is how predictive is this stuff really? You read about the A's back when and such, but I don't see the money flowing that way, it is going pretty much for athleticism and strong arms. Fairly simple predictors really are what the people with the most to lose are doing.

  10. It's a Darwinian world. Most of the new stats based on rich, new data will not be useful; some will. Eventually the field will settle down — but it will be less exciting then. Bill James will be gone, Tom Tango will be replaced by Wally Waltz …

    As to prediction, there's one of my favorite statistics quotes:

    "The race is not always to the swift nor the battle to the strong, but that's the way to bet."

    ~Damon Runyon, "More Than Somewhat," in reference to Ecclesiastes 9:11, "I returned, and saw under the sun, that the race is not to the swift, nor the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all."

    Source: http://www.quotegarden.com/gambling.html

  11. I can't believe no one's recommended Jim Albert and Jay Bennett's awesome book Curve Ball.

    It addresses pretty much every question brought up in this post. It's a great intro to the Bayesian way of thinking about statistical modeling, even if you're not a huge fan of baseball.

    The book's not at all challenging technically (i.e. no calculus or matrices, just counting and graphs), yet clearly introduces all of the important statistical concepts from modeling uncertainty to Markov chains. The examples are fantastic, ranging from batting averages to expected runs given a situation to game-changing play potential to streakiness.

    P.S. There are predictive applications if you work for a major league team or are involved in sports betting on either side. Then there's simulation-based modeling for fantasy baseball. Albert and Bennet's book even explains how to use simulation for prediction just like Andrew suggests in his books.

    P.P.S. Albert's Bayesian stats in R book provides more details along with the R and BUGS code for many of the models discussed in Curve Ball. Albert also wrote an intro (non-calc) stat book based on baseball examples, but I found Curve Ball more interesting and more insightful.

  12. My experience attending meetings of SABR is that sabermetricians (Bill James included) are very clever at coming up with new measures – some useful, some not. What many lack is a good grounding in statistics – Bayesian or otherwise. Generally, the goal is to come up with another statistic to rank players. But, the rankings don't seem to change too much. Thus, developing and studying models of the underlying structure of baseball is mostly lost on them, and you will rarely, if ever find those papers in the Baseball Research Journal. Mark Pankin and his work on Markov models for run/out situations is an exception.

Comments are closed.