How to predict the winner of the race that stops a nation (and other horse races)

We don’t want to stirrup trouble, but we’ve got a hot-to-trot model for predicting the 2020 Melbourne Cup Winner

Ok, time to rein in the puns. WARDY IT Solutions have created a tried (but not yet fully tested) Melbourne Cup Prediction model. In a bid to secure financial security, members of the WARDY IT Solutions team put together a strategy for predicting the winner of the Melbourne Cup and any other horse races where the prize money is high.

So what is the secret to predicting the winner?

Data Gathering

The first phase of the model required researching all factors that could affect a win (or be deemed likely to affect a win), specifically:

The historical data of all Cup winners since the inaugural event in 1861 was collected
Key information about the 2019 runners (including weight, age, handicap, gender and pedigree) was collated in a spreadsheet
Wagering information noting the favourites and rank outsiders were added to the spreadsheet
Prior wins and the locations of these wins were added to the spreadsheet along with the country of origin
Statistics for participating trainers and their track records were then added
Information on the barriers drawn for each horse along with barrier selections from historical winners were included; and
Statistics based on the nominated jockey (including age, weight and previous wins) were inputted

The above information was consolidated to create a detailed dataset on which to predict the winner of the infamous Cup.

Ranking algorithm

In much the same way that Google applies a page ranking algorithm, the participating horses were ranked on:

Barrier advantage;
Official ranking of the horse;
Group wins;
People’s rating;
Weight; and
Age.

The ranking used the following criteria:

Sorting Order	Criteria	Sample Data	Ascending/Descending
1	Barrier advantage	Y/N	Descending
2	Official Ranking of the horse	1, 2, 3	Ascending
3	Group Wins	1, 2, 3	Ascending
4	People’s Rating	4.5, 5	Descending
5	Weight	52.5kg, 54kg	Ascending
6	Age	4,5	Ascending

Likewise, the participating jockeys were ranked, using a similar technique, based on:

Barrier advantage;
Jockey’s career wins;
Jockey’s group wins; and
Chances of winning (%) based on people’s input.

Technical Implementation

The next step was to connect the data. This was done by using analytical capabilities of Power BI to connect to the spreadsheet as a data source. Three datasets were then created in the Power BI file, all connecting to each other in a 1-1 relationship. These were:

Horses
Past Cup Winners
Jockeys

Using the predetermined criteria, the data was then queried.

The theory was that where there was an overlap in the top five in each dataset, this would then be considered the winner.

So unfortunately, the model wasn’t ready for action in 2019… but there’s always next year!

Perfecting the model for 2020

For the 2020 Cup there are several areas where data could be improved to yield the necessary overlap and predict the winner. The gaps in information include:

The historical data for all past races used was incomplete due to limited archives available online.
Historical pedigree details weren’t available to add to the data
Not all the participating horses had an official ranking yet (due to some being overseas competitors)
The weather on the day, along with the condition of the racecourse were not considered and added to the data, this should be included to create an accurate prediction

Because of the sparse data used for the 2019 race, it was difficult to leverage Azure’s machine learning which created a barrier for the model. Improved and more detailed data will help overcome this obstacle and improve the prediction for 2020.

All jokes aside, can we actually predict the winner?

Provided that all the additional data can be sourced then there could (maybe) be a couple of ways that machine learning or AI can predict the winner:

Machine learning: A model could be built in a way that allows Azure services to access and leverage all the data that is fed into it. It would then be easier for Azure to predict the winning horse/jockey.
Reproducing manual predictive analysis steps in Azure: If the technical implementation steps are replicated using a sorting algorithm (for example, the Pigeon-hole algorithm written in Python) within the Azure environment then that could be another way of coming up with the winning horse / jockey.
Pattern recognition: Over the past 158 years there could potentially be a pattern that has developed over the years and if AI can detect a pattern then that could be another way of predicting the winning horse/ jockey.

So the answer? The odds might be stacked against the predictor model but at the end of the day, if you want to win the race, you’ve got to be in the race. Watch this space…