So we’re 106 games into the season and the Braves have scored 601 runs and allowed 442. Let me start this exercise by pointing out that there is no real reason to want to figure out what those numbers will be after 162. I’m using “Will the Braves finish first in both most runs scored and fewest runs allowed in the NL” as a subterfuge to do something I’ve wanted to do for some time: explore the distribution of runs scored. There are a lot of things you can do if you have a decent handle on that question – projecting end-of-season run totals is a simple one.
So there are lots of ways to project end-of-season runs (scored or allowed – I’m going to use the same method for each of them, so for simplicity in explanation let’s just think about runs scored.) The simplest, of course, is to simply project current runs per game: 5.67 x 162 = 919. That’s not a terrible method, but it has two problems. The first is that it’s really susceptible to outliers: a 22 run game is highly unlikely to be repeated in the remainder of the season, but it represents about 0.2 runs per game.
Second, while you might get a good idea of the mean number of runs, this method gives you no way of estimating the uncertainty. How likely is 900 or more? How likely is 875 or less?
To answer this sort of question, you need a model of the distribution of runs scored. There are a lot of such models, but I wanted to experiment with one that hasn’t been used that much because the computer power to estimate it has been in short supply until the last ten years or so. Plus, it let’s me use a cool phrase: Zero-inflated negative binomial regression. (OK. The coolness of that depends on your having a fairly uncool peer group, but I do.)
I’m not the first person to use ZINB (as the coolest kids say over their Red Bull sessions) for this purpose. The pseudonymous “Patriot” discusses it here 11 years ago and notes he is far from the first to think about runs this way. He knows what he’s doing (and gives a pretty good explanation of what it’s all about) but he closes the series with this: “Hopefully I’ve provided enough promising results to encourage those of you who are skilled at this type of problem to consider the negative binomial as a model for runs per game.”
Challenge accepted, Patriot.
The basic idea is to come up with a formula for the probability that a team scores I runs a game for all possible values of i. So you need a probability of 0 runs, a probability of 1 run, etc, up to any particular level you want, though in practical terms 16 runs is probably enough. The simplest way to implement that is to just use the empirical percentages: for example, the Braves have scored 3 runs in 10 games so far this year. 10/106 = 9.4%. An advantage of this method is we have an easy way to simulate the uncertainty. We use this empirical distribution to simulate the remaining 56 games. For each game we pick a run total with the probabilities established by the empirical distribution. Once we have all 56 games simulated, we add up the simulated runs scored. That’s now one possible future. We simulate 10,000 possible futures and we now have a distribution of incremental runs scored that we can add to the 601 observed. Now we can tell you the probability of 875 or less… it’s just the fraction of 10,000 simulations with less than 875 runs.
So that method will work, but it still has a problem. No one thinks the empirical probabilities are that predictive. The Braves have had 15 4-run games, 11 5-run games and 13 6-run games, but no one thinks that 6-run games are really more likely than 5-run games. The empirical probabilities describe what has happened, but include substantial uncertainty, and you’d like to squeeze that uncertainty out before you simulate.
Instead, what we want is a discrete probability function which creates the required distribution as a function of a few parameters which we will select to match the observed data as well as we can. The great advantage of doing so is that we can use the divergences observed between the observed fractions (like the 9.4% above) and the estimated fraction to get an idea of uncertainty. For a number of technical reasons, the ZINB model has appealed to people, but it has always been a pain to calculate the parameters. However, now it’s a lot easier to do. So I did.
But before discussing the predictions, I want to talk a little about a philosophical underpinning: every day is different, every day the team is a little different, every day your opponents are a little different, every day you feel a little different, so why is it we think there’s an underlying distributional at all? I want to be clear as someone trained in the philosophy of statistics: That’s a deep and controversial question! And I’m not going to be able to really answer it. (I can be controversial, but I’m rarely that deep.)
Essentially, though, underlying the theory that there’s a functional form that underlies the probabilities is that life is constant enough in its inconstancy that we can derive models that inform us about something we want to know. We can’t really know this. When two statisticians argue over either the functional form to use or the data to use to derive the parameters (and occasionally over the method to use) they are usually arguing about the nature of reality, not the nature of statistics. The numbers inform their views, but they cannot be decisive. As the adage occasionally attributed to Yogi Berra has it: Prediction is hard, especially about the future.
So I’m going to choose to believe that there is a unified entity called The Atlanta Braves and another one called The New York Mets, etc, even though I know that the remaining games will be played by different people on both of those clubs and that their motivations and psychology and skills in August and September may be very different than they were in April-July. My assumption (and yes, I certainly just made an ass of myself, and you too maybe) is that all of those things average out. But there’s no way to prove it, and hindsight is guaranteed to make fools of most of us. That’s the fun part!
Second, even if there is a defined thing called “the probability that the Atlanta Braves will score 7 runs is a game from August through September” we are never going to have all the data we might like to measure it precisely, nor are we going to have enough data to pin it down even if we knew the exact functional form. That seems to really trouble a lot of people, but it doesn’t trouble me. As a great statistician remarked: All models are wrong, but some are useful. And usefulness is a personal quality.
So I estimated a ZINB curve for all 30 MLB teams. There are two parameters in which each team gets their own number, reflecting (in essence) the strength of the team and the probability that they will be shut out in any particular game, which is taken to be partly independent. (That shutout probability in the “ZI” part of ZINB.) Then there is another variance parameter that is shared by all the teams.
Here are the graphs showing the actual (through 8/1) and estimated probabilities of both runs scored and runs allowed for your Atlanta Braves.
If you’re of a certain malevolent cast of mind, you’ll say “Hey, the curve and the actual are really different,” whereas if you’re a kind and gentle soul you’ll say, “Nice! Pretty good representation!” And I have 29 other graphs that you will similarly either find interesting or reflective of actual stupidity.
But it doesn’t matter. Armed with a model I can answer questions. I can simulate the remaining games from the schedule and answer the question I originally asked: what is the probability that the Braves lead the NL in both runs scored and fewest runs allowed?
In 10,000 simulations, the Braves won NL runs scored 5,721 times while the Dodgers won 4,269 times and the Cubs won 10 times. In runs allowed though, the Padres won 8,003 times with the Braves winning 1,174 and a smattering of wins for the Giants (356) Brewers (208) Marlins (150) Phillies (70) Cubs (38) and yes, the Mets win once, although the model doesn’t know about dumping Scherzer and Verlander and a bunch of hard-earned Steven Cohen money. (Then again, that migh actually improve the Mets defensive results… who knows?)
So, as of today, the Braves probability of winning both is .5721 x .1174, or a little under 7 percent. Expected runs scored is 907 with a standard deviation of 26 runs; expected runs allowed is 674 runs wth a standard deviation of 22 runs.
The biggest flaw in the model is that it doesn’t account in any way for strength of schedule: the Braves chance of scoring 5 runs in a game is independent of opposing pitcher or opposing team. The structure to do that is entirely straightforward, but there are some technical reasons why it’s hard to do that dealing with near-singular matrices that I won’t go into at the moment. I’m working on it. But since it probably won’t be finished until everyone already knows than answer to this, and because it’s an off-day, I’m posting now.
I’m opening for comments, so please make them if you’ve got them.
[Editor’s note: I’m closing the comment thread on this post; if you want to comment, please go over to Cliff’s game recap.]