Recently, I spent some time thinking of possible machine learning projects I could work on. I had planned on taking part in Kaggle's March Madness Machine Learning competition, but due to school commitments was unable to fit the time in my schedule to do so. Consequentially, I set my mind on doing another sports-related analysis. The NFL Draft was right around the corner at the time, so I decided that would be a good target for analysis.
There are previous instances of people who have attempted to undertake a similar problem, most notably here. For my analysis, I decided that I would take a look at similar data and see what sort of predictions I could make.
Process and Plans
I decided that I would try to work on this project from the bottom to the top, starting from scraping all of the data myself. I had been meaning to do a project involving R web scraping for a while, so I decided to try to use that for this project. My primary source of data was Pro Football Reference's draft websites; for example, https://www.pro-football-reference.com/years/2017/draft.htm is the link I used for the 2017 draft. From there, I was able to iterate through all of the players and get their college stats and combine results, the latter from their NFL pages. When collecting the data, I observed that tackling stats were not recorded for defensive players until the 2005 college season. I decided to only use data from 2008 onwards as a result, giving me 10 years of data.
There were a few drawbacks in my methodology for collecting this data that I only realized very late in my analysis. First of all, I only ended up collecting data for players that were drafted; there was another link on the website that would have given me the information for undrafted players as well. Additionally, there were a lot of players that had missing combine stats. As a result, my imputation method may haave been a little bit unreliable; it would have been good to try to find more reliable predictions or pro day sources so I could get more accurate results.
After scraping the data for all the years, I used the "mice" R library for imputation. I went back and forth on using multiple imputation and single imputation; I ended up using a single instance of multiple imputation, but this may not have been the best way. I used all of the collected combine stats, as well as height, weight, and position, to assist with imputing the other results. I found this gave reasonable results for the purpose of what I was doing. Naturally, our results would be more accurate the more information we had.
Analysis and Conclusions
All of the data I collected is stored in my github repository for this project. At the time of posting, it does not hold all of my code, and it is certainly not optimized; however, the data I used is all correct.
After scraping the data, I moved on to analysis. I decided on using Python with Jupyter for my machine learning algorithms, as it is what I am most comfortable with. As all of the players in the data set were drafted, I had well-defined picks and rounds for each one. Additionally, the round the player was drafted generally has a linear relationship with the quality of the player; as such, it would seem feasibile to attempt to predict the round and use that as our independent variable. That is what I ended up doing.
At first, I attempted to fit and predict the round using a simple linear regression on the data, with the goal of minimizing the round prediction error for the 2017 draft. I used a linear combination of XGBoost, Random Forest, and linear regression models, with 0.4-0.4-0.2 proportions. The mean absolute error between the predicted results and the test results was about 1.4. The model's top 32 players included 15 of the first 32 actual draft picks, as well as 17 of the first 34. The results for the 2018 draft were as follows:
Rank | Name |
---|---|
1 | Kolton Miller |
2 | Mike McGlinchey |
3 | Saquon Barkley |
4 | David Bright |
5 | Terrell Edmunds |
6 | Harold Landry |
7 | Lamar Jackson |
8 | Bradley Chubb |
9 | Justin Reid |
10 | Derwin James |
11 | Josh Allen |
12 | Martinas Rankin |
13 | Joseph Noteboom |
14 | Harrison Phillips |
15 | D.J. Moore |
16 | Frank Ragnow |
17 | Dane Cruikshank |
18 | Avonte Maddox |
19 | Riley Ferguson |
20 | Sam Darnold |
21 | Troy Apke |
22 | Jaire Alexander |
23 | Minkah Fitzpatrick |
24 | Darius Phillips |
25 | Kylie Fitts |
26 | D.J. Chark |
27 | Taven Bryan |
28 | Malik Jefferson |
29 | Marquis Haynes |
30 | B.J. Hill |
31 | Godwin Igwebuike |
32 | Josh Rosen |
These predictions seem reasonable in most cases; particularly, top-caliber prospects such as Saquon Barkley and Bradley Chubb appear high on the list. Curiously, three of the top five ranked players are offensive linemen. Two of them are slated to become first round picks, but the third is projected to be taken late in the seventh round or go undrafted. This is probably a byproduct of a lack of proper stats for offensive linemen, and there isn't very much that could be done to fix this discrepancy. In the end, it appears that these projections would have similar accuracy to the ones for the 2017 draft.
There are a few other surprisingly high ranked prospects, which could be caused by multiple things. Firstly, the imputations may be inaccurate, which could cause for some prospects to have much higher grades than overall. The regression also may not properly take into account the level of competition, causing people who played in worse divisions that put up better statistics to get ranked higher. Looking into possible causes of discrepancies would be one of the first things I would do if I tried to improve my results.
As an aside, these predictions are also applicable to rounds other than the first round. Here, for example, is the five players listed as most likely to get picked in the seventh round:
Rank | Name |
---|---|
1 | Davin Bellamy |
2 | Keishawn Bierria |
3 | Darius Phillips |
4 | Greg Stroman |
5 | Nick Gates |
Next Actions
Overall, I would say that I was able to achieve some notable results from this analysis. Many of the consensus "top" prospects peppered the upper echelon of my rankings. There was, of course, quite a bit of room for improvement. I alluded to a few ways to do so earlier, such as improving the way I collect data to eliminate inconsistencies with imputation. Naturally, collecting more data would also make this more accurate. I do not currently have any information about undrafted prospects in my dataset, which could slightly skew the predictions. There were also alternative paths I could have taken with my analysis. It would have been relatively simple to predict the probability of a player going in the first round; I could simply change the "Round" feature to be a binary feature depending on whether the player was drafted in the first round or not, and then run some version of a logistic regression. There is also certainly more room for feature analysis and selection within this specific regression. The process of collecting NFL draft data could also prove to be useful in the case that I decide to do further NFL analyses; for example, I could try to predict NFL success given this data. All in all, I found this project to be an interesting foray into web scraping and applied machine learning, and hope to build on it more in the future.