Imagine this: it’s the bottom of the 6th inning and the bases are loaded. You’re walking up to the plate, down two runs, and Gerrit Cole is staring you down from the mound as he determines what he’s going to throw to you for the first pitch of the at-bat. Now, what if I told you that I could predict, with 52% accuracy, the next pitch he’d throw. Would you want to know what’s coming?
Okay, predicting the next pitch in an at-bat sounds a little crazy and perhaps even slightly magical, but I’m no Tolkien. Instead, I’m a dude who loves baseball and uses any opportunity to work on baseball and data projects. For a class I was in this spring (shout out to Professor Tomuro and the Neural Network and Deep Learning class), the final project tasked us with using Neural Networks to solve a problem. Though not many of us face the issue of battling Gerrit Cole, I thought it would be a fun project! Let’s get the obvious caveat out of the way here: yes, a detailed scouting report could be more informative and capture the nuances of a pitcher’s strategy – especially as they tinker throughout the year and yes, it’s possible that a Neural Network is not the best model for this problem, but hey, it was for class and I didn’t want to tick off the professor. But, for the sake of my sanity and this post, let’s pretend like this is the only source we have!
With this out of the way…what in the heck is a Neural Network? Simply put, a Neural Network is a machine learning algorithm and is modeled after the human brain – hence the name Neural Network! They take input data (at-bat logs), learn the patterns present within the data (Gerrit Cole really likes to throw fastballs), and predict an outcome (here comes that fastball we mentioned earlier).
For this project, I used a data set from Kaggle that had MLB at-bat logs from 2015-2019 – but I only focused on the 2019 season. I cleaned the data, feature engineered (used existing features to create new ones), performed an exploratory data analysis, trained the model, and predicted outcomes. I was specifically interested in predicting the next pitch type of an at-bat based on the current in game situation. Because of this, I only included information that pertained to the in-game situation (inning, score, count, handedness of the batter, etc). After too much time trying to get the most accurate model, I finally ran the model and generated outputs.
Without further ado, here are the results! As you can see, I picked out 9 current MLB pitchers and only ran their pitching logs through the model. I did this because there is no use training a model for Gerrit Cole’s patterns to try and predict Lucas Giolito’s next pitch type…that doesn’t make much sense! Though, for fun, I did train the model on all pitchers from 2019, just because I was curious.
One thing to consider when evaluating the results of the model is how many pitches a pitcher throw. For example, Yu Darvish was logged having thrown 9 different types of pitches in 2019. Granted, he did not throw each pitch equally, but if you were to randomly guess the next pitch, you’d have an 11% chance of correctly guessing. The model was able to correctly predict the next pitch 30% of the time.
Another interesting observation is that Hyun-Jin Ryu was logged having thrown 6 types of pitches. There were two other pitchers that I focused on that threw 6 types of pitches in 2019: Marcus Stroman and Trevor Bauer. The model was able to predict Ryu’s next pitch type 29% of the time, but able to predict correctly for Stroman and Bauer at 36% and 38%. Why the difference? My best guess is that Ryu split his choice of pitch equally among the pitch types that he has in his arsenal, whereas Stroman and Bauer focused on a few pitches. Or, perhaps Ryu did a better job of mixing up when he’d use a pitch, dependent on the in-game situation. Or, a combination of the two…or none of these hypotheses! One of the downsides of Neural Networks is that they’re a “black box” algorithm and you don’t quite know the effect that features have on each other.
(How to read: the size of the box is relative to the number of that pitch type thrown in 2019, by the corresponding pitcher).
With all of this in mind, would you want to know, or would you trust your instincts?
Link for GitHub where the notebook showing some of the code for this can be found here!
コメント