Let Data Improve Your Tennis Game

Data Cleaning

The first step in the preprocessing was to deal with columns with null and zero values and then combine all the individual datasets into one big dataset. Since all the datasets contained some of the same features, this was straightforward.

Figure 2 : Pandas Profiling report of some of the columns with missing values or high cardinality

The second step was to remove the leakage and some biases in the dataset. Since the original data labelled all the data with the column names “total points won” and “sets won”, depending on whether the data belonged to the winning player or the losing player, it was necessary to remove some of the columns to prevent leakage and also relabel some of the relevant columns to avoid the bias that might result when using the data to build our model. To do so, I randomly assigned player1 to either the winner or loser, and player2 to the other player. The random assignment resulted in player1 as the winner around half the time.

The third step was to filter the dataset to avoid zero or null values and to only include those observations where the ranking of both players was available, since I intuited that this would be the strongest predictor. This reduced the number of observations from around 10k to around 6k. Although this was a dramatic reduction, it turns out that most of the discarded data lacked significant information anyway.

Figure 3 : Null values from some of the columns. It turned out during those years they were not recorded

Feature Engineering

Some prior machine learning models used only the ranking of the two players to predict match outcome. This makes sense, since the ranking captures a players performance over the past year, and is likely a strong predictor of the player’s current ability to win against a poorly rated player. However, there are many other types of information that might be useful in predicting the outcome of a match, and in some instances they could be used to predict the match results in live format. For example, the past head to head of two players could be extremely relevant, especially the most recent matches, or after deploying my model I found out winning the first set is a strong indicator of the match result. The quality of a player’s service game and return game is also likely of importance.

Figure 4 : Relationship between two of feature engineered columns

To better capture the nuances of each player, I decided to compute the past head to head of each player for each match, and service metrics (service / return scores) from past matches for both players. These data were only available post 1991. To ensure that enough data was available in the past, I decided to further restrict the dataset to matches post 2000. That way, I would have about 19 years of past match data to compute these statistics.

The features I computed included: aces per point, double faults per point, head to head results between the two players, first serve percentage, second serve percentage, etc. I scaled the serve data by point to avoid the bias that would occur if, for example, I had used number of aces, since a player may have had more opportunities to hit an ace than his opponent. One issue that arose is that some observations included new players, for which there was no prior record of performance. One option was to label all statistics for this player as 0, but that would likely produce biased results, since 0 is the lowest metric, and just because a player has no prior matches in the record, does not mean that he should be assigned the worst score. I ultimately decided to delete all observations with 0s in them, which does not seem like the best solution. This is something to look into in the future.

Finally, most modeling in the past combined the player 1 feature and the player 2 feature into a single feature. For example, for the ranking feature, subtracting the two rankings would consolidate the two features into a single feature. This has the advantage of producing a symmetric model and reducing the feature space by half. However, it has the disadvantage of eliminating information. I decided to not consolidate any of the features into a single feature.

Predictive Modeling

Since this is a classification problem (Player victory = True/False). To test the models, I first split the data into a 80% — 20% train test split and I used cross validation with grid search to select the best hyper parameters, refit the best hyper parameters to the full train set, and tested the model on the validations and test set through a 5 fold cross validation.

The baseline for this problem was the majority class accuracy score of 0.51, meaning just by guessing we could predict a player won the match 51% of the time. I used variety of linear and tree based models such as logistic regression, single decision tree (to find leakage), random forrest and XGboost to predict the winner. XGboost out performed the other models by ~1% in which the outcome came to be a validation accuracy of 96%.

I then used permutation importances to find which features are contributing the most to my model and found out the most important feature predicting the winner is players service rating and after that if the player won the first set. Some of the other important features contributing to my model were return rating, break points made and tiebreaks won.

Figure 5 : Permutation importances

I also plotted a single feature partial dependence plot (fig. 6) to further explore the serve rating feature and it was interesting to find out service rating starts to be important in prediction the winner when its surpasses ~230 points and gradually improves the chances of winning until it gets to~330 points. It worths noting that service rating loses importance in predicting the match results for players who have a service rating of more than ~330 points as they would all have more than 60% chance of winning the game.

Figure 6 : Single feature partial dependence plot for serve rating

I then moved to figure out how service rating and breakpoints made could contribute to more accurately predict the match results. For that purpose I used a multiple feature partial dependence plot (fig. 7) and just by having a service rating of more than 297 and making more than 3 break points in the game, a player could have a ~90% chance in winning the game. Learning this was a very insightful for me both in the aspect of improving my game and predicting tennis match winners.

Figure 7 : Multiple feature partial dependency plot for two of the most important features in my model

I also used Shapley plots to briefly explore what features are impacting a single match in the model, positively or negatively and how strong is their impact in predicting that particular match. As laid down below in figure 8 if players serve rating is 178 or lower, he only makes 1 break point and he lost the first set, we can predict with almost 100% certainty that he will end up losing the game.

Figure 8 : Shapley plot

Conclusion

From all the linear and tree based models use XGboost performed best delivering 96% accuracy in predicting the match results. During the process I learned some of the most important metrics in predicting the winner in a tennis match are rating of the player’s service and return, breakpoints made, tiebreaks won and wether the player was successful in winning the first set.

I would like to expand this model by figuring out the important parameters of predicting the winner of the first set using the data derived from the first set and history of head to head matches between the players and I’ll be posting the results here and in my portfolio. Until then I’m going to put my money where my mouth is and try to live predict a game using my finding here, and try to improve my game by focusing on landing better services, improve my returns, landing more breakpoints, winning tiebreaks and doing my best in winning the first set.

Data Cleaning

Feature Engineering

Predictive Modeling

Conclusion

Recommend

删库背后，是权限管控的缺失

Nginx Ingress 控制器工作机制

无节制流氓推广，2345旗下下载站正在传播木马程序

黑客如何利用wifi入侵安卓手机——kali 001 - 简书

Kubernetes 部署 Nebula 图数据库集群 - 知乎

【原创】复数神经网络的反向传播算法，及pytorch实现方法 - 简书

Substrate区块链应用的交易费用设计 | 深入浅出区块链技术博客

MIT 6.824 分布式系统课程第一课：介绍笔记 - Jeffery的博客 | Jeffery Blog

Moznet IRC is dead; long live Mozilla Matrix! | Matrix.org

Firefox: How Mozilla wants to fight against Google's dominance - derStan...

About Joyk