Project by Matej Frnka & Daniel Workinn
This blog post talks about what was needed to create and run a serverless ML application that predicts results of upcoming football games and attempts to make money by betting on the predicted winners. The focus is on the overall architecture rather than on the data science of beating the bookmakers.
You can find all the code behind this serverless ML app in this public GitHub repository. Don’t forget to give it a star ⭐
The overall MLOps solution is based on 4 pipelines:
To build this serverless ML application, we used the following managed services for MLOps:
As you can see on the diagram below, we kept everything self-contained in it’s own application to be able change and run separately each individual part of the application.
All apps mentioned above have very generous free versions, so I can only recommend them for personal “messing around with things” type projects. Modal gives you 30 dollars to spend every month, Hopsworks gives you 25 GB of storage and Streamlit lets you make a free website to run however long you like.
This initial step gets everything running. It requires a little bit of manual input, but it is only needed to run once.
First, as any good object-oriented project should be, we wrote a class that scrapes a given league and saves it as a dataframe to a parquet. We won’t go into implementation details in this blog.
Then, we ran an instance for every league and country in modal with this class. We used modal persisting volume that we mounted to every instance to store all output files.
We then monitored everything in Modal dashboard.
Even with the parallelization, this step took almost a day to complete. It could have been sped up more by parallelizing the scraping of individual leagues since some leagues were scraped almost instantly and others took a lot longer.
After everything finished running, we processed the data and uploaded it all to Hopsworks. To do so, all we need was a couple of lines and we could store our dataframe persistently and access it from anywhere:
Hopsworks automatically uses pandas dataframe dtypes to set column types. You can also specify the datatypes directly. Unfortunately, only numpy dtypes are supported at the time of writing. This limits you a little bit if you want to store null values, because numpy doesn’t support nulls for some data types like boolean. Luckily, support for pandas dtypes in Hopsworks is coming in the near future.
To train a model, we download all data from Hopsworks:
And then train the model using tensorflow and upload it to the Hopsworks’ model registry. This is done by first saving the model locally, and then uploading the local folder like so:
Notice we also upload metrics. We later use them to download either the best-performing model or the newest one. Usually, you would just want to get the best-performing one, but since our test data is changing over time, we can’t 100% rely on the performance metric, so we just use the newest model - It is possible that 2 years old model had better performance on test data from 3 years ago, but maybe not on today's data.
Continuous scaping wasn’t too different from the initial scrape, we changed our scraper to scrape from the newest matches and stop once it gets to a date we already have.
We also made our predictions for the upcoming games. To do so, we simply downloaded the model from Hopsworks and used it to make predictions:
We saved our newly scraped data directly to Hopsworks - our predictions to a new feature group and new results to the "fg_football" feature group, which was done using the exact same code as for creating the feature group. Super simple!
To run the app periodically, we used modal’s scheduled runs by updating the function annotation:
And then deployed it to modal using modal deploy scrape.py.
The final step is to show our predictions to the user. All that was needed was to download data from the predictions feature group the same way as shown before and display them with streamlit table.