16

Can We Beat the Bookmaker with Machine Learning?

 3 years ago
source link: https://towardsdatascience.com/can-we-beat-the-bookmaker-with-machine-learning-45e3b30fc921?gi=3c492095140f
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Predicting profitable soccer bets with a LSTM model

3mQ3amr.jpg!web

Photo by Erik Mclean on Unsplash

Note from the editors: This article is for educational and entertainment purposes only. If you want to use the presented model for real bets, do so at your own risk. Please make sure that this is in alignment with the terms and conditions of your bookmaker.

With the outbreak of the illness and the corresponding shutdown of the economy, millions of people, unfortunately, lost their jobs. Desperate times call for desperate measures and we might be interested in creating new unconventional sources of income. What about using machine learning to predict soccer game results and thus assisting us in placing profitable bets?

In this article, we use a simple Long Short-term Memory (LSTM) model to predict soccer results. We build a program that iteratively trains the model on the previous rounds before predicting the outcomes of the next round.

You can follow along with this Google Colab notebook .

LSTMs

If you don’t know what LSTM models are, I highly recommend Michael Phi explaining them in hisoutstanding article.

In a nutshell, LSTMs are a special kind of Recurrent Neural Network and are often used for sequence models (e.g. natural language processing) or time series prediction (e.g. stock market, demand, temperature). Since soccer games over a season represent a sequence of data and a model could use previous events to predict the future, we give it a shot with a LSTM betting model.

Getting the data

The foundation for every Machine Learning model lies in the data. Fortunately, we can easily google for datasets containing game scores and betting odds. For our model, we will use the data of the 2018/2019 Bundesliga which can be downloaded from here .

Since the downloaded csv contains a lot of data we don’t need for our model, let’s simply drop the unnecessary columns. In my GitHub repository you will find a reduced dataset including only the date of the game, the teams, goals and full time result as well as bet odds from B365 (home win, draw and away win). Let’s take a look at the first five rows.

ei6bYfi.png!web

Data preparation

Instead of building our model to predict the full time result neglecting the goal difference, we want it to predict the difference between the goals scored by the home and away team. Further, we want to coordinate training and prediction of the outcomes by rounds, e.g. train the model on the first 9 rounds and predict the results of the 10th round to place bets and so on.

Since neural networks require tensors of floating-point data or integers, we cannot simply use the teams’ names as input for our model (Chollet, 2018, p. 101). In order for our model to be able to use the teams as input, we assign each team a unique integer value. For this, we create a team name vocabulary and represent every team by their individual id based on their index in the vocabulary. For the actual training, we will input these indices into an embedding layer which is explained below.

Chollet (2018, p. 101) also suggests that the data should be normalised (all values should be small and in the same range) to facilitate the training process for the network. Since we predict the difference between the goals scored by the home and away team, and these values can be realistically in the range of -5 (0:5) to 5 (5:0) or even further distributed, we should definitely scale the data.

After preparing the data, the first five rows should look like this:

3yMRJzj.png!web

Modeling

As mentioned above, the model should be trained on the previous rounds and then make predictions for the next round’s games. To create distinctive feature (X) and label (y) sets, I modified Venelin Valkov ’s create_dataset function of his demand prediction model to meet our requirements. If you are interested in further applications of LSTMs, I definitely recommend his article.

Let’s suppose we want to observe the first five rounds of the season before making any bets. The data of these observation rounds are thus used to make some adjustments to the model’s architecture. Specifically, we want to figure out, how much training the model requires before it overfits.

bQZBV3q.png!web

As we can see in the chart above, the train loss decreases for around 20 epochs before the model seems to start overfitting. Thus, we will later train the model for 20 epochs before making our predictions.

Note: You might see different results when you are running the model. Unfortunately, Keras’ suggestions to make the model reproducible do not seem to work with Google Colab.

Creating the program

In order to set up a program to iteratively train on the previous rounds, predict the outcomes of the next round’s games and then place the bets, we should split the whole process into pieces and define functions for each step.

  • Data selection : Selecting the previous rounds for training and the current round for prediction
  • Modeling, training and prediction : functions to create and train the model as well as make predictions and format the output
  • Betting : Choosing games to place bets based on the model’s predictions
  • Putting it all together : Iterative training and prediction of rounds and compilation of the results

Data formatting

Since we loop over the dataset and train the model on the rounds before the round we want to predict, we have to recreate the dataset after every round. We use the create_dataset function from above to create a train and test set for every iteration. Since we need to return multiple values, let’s put them into a dictionary.

Modeling, training and prediction

Above I already outlined how we are going to model our LSTM and you might be wondering, what an Embedding layer is and why we need it. According to Chollet (2018, p. 186), this layer can be understood as a dictionary that maps integer indices (i.e. team indices) to dense vectors which can then be used as the input for the model. We need it since otherwise our rather meaningless team indices (from 0 to 17) could potentially be misinterpreted by our model.

I used rather random values for the hyperparameters and definitely encourage you to test different combinations in order to optimise the model’s performance.

Since we let the model predict scaled game scores, we have to inversely transform these predictions in order to retrieve the predictions in the wanted format. To create a new data frame containing the model’s predictions, we need to reshape the inversely transformed game scores. Finally, we stack the actual and prediction scores into a dictionary that is then returned by the function.

Betting

Based on the model’s predictions, we want to place our bets. I am definitely no betting expert, but I would suggest betting only in cases when the model is quite confident that either the home or away team is going to win (score ≥ |1.1|).

The calculate_bets function creates two lists, one containing the bet units we invested (1 if we bet on this game, otherwise 0) and another one the winnings. The winnings are 0 for games that we did not consider for betting or in case we placed a losing bet (e.g predicted home win, but it was a draw). If the model correctly predicts either a home or away win, the winnings for this game are equal to the corresponding odds of this outcome.

Before we can create a loop that puts everything together and predicts every round of the season, we create a new data frame based on the test set (one round of 9 games) and add the retransformed actual and predicted scores as well as columns for the invested and won bet units.

Putting it all together

Now, we can add all the pieces from above together and define a function that loops over the rounds, create the train, test sets and the model. Then the model is trained and the results predicted and later put together in a data frame.

Let’s beat the bookie

Having all the pieces put together, it is time for us to beat the bookmakers!

As I already mentioned above, the model cannot create reproducible results, which means, that your results might differ from mine.

My model made a total of 92 bets meaning that it invested 92 bet units. The model is probably a way better game predictor than I am and made winnings of 94.71 bet units, meaning that the net winnings account to 2.71 bet units :fire:

In order to analyse how our model performed in each round and see the winnings over time, let’s plot the data.

M3AVRra.png!web

We can see, that the 9th, 20th, and 21st round were among the worst, since we did not win any bet. In order to improve the model it would definitely make sense to look further at these rounds and figure out what might have led to these results. Potentially, the clear favourites were defeated by the underdogs or our model just made bad predictions.

Conclusion

Congratulations, you just beat the bookmaker. Well, at least hypothetically, in 2018/2019.

We both know that the model is far from perfect, but it is a simple approach that seemed to work for the 2018/2019 Bundesliga season. Additionally, we worked with some data, built a LSTM model with Keras and plotted the results. I am sure that the model could also be applied onto other leagues and even different sports such as ice hockey, basketball or American football.

If you have any suggestions for additional features or model architectures that might improve the model, please let me know. I consider extending the current model in a future article with more advanced features.

Thank you very much for reading!

Stay save and Happy Coding!

References

Chollet, F. (2018). Deep Learning with Python. New York: Manning Publications Co.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK