Using data to predict movie ratings
With the Oscars just round the corner (March 12th), it seems fitting to keep on top of the latest movies and see if data can help find the diamonds in the rough!
In my previous article I looked at upcoming sequels, with a lens of which movies would come out on top. A question that kept coming back was “can we predict how non sequel movies are going to do (rating wise)?”
Where there is data, we find an answer!
So how did we come up with this short list?
Well, for any new movie coming out this year, we know a fair bit about it via IMDB - the cast, director, writer and producer. We then used Machine Learning techniques to look back in time and searched out if any of these tells us whether a movie will perform well or not.
It shouldn’t come as a complete shock that the people who “design” the movie (director, writer and producer) have a massive influence on the outcome. We looked at how the last movie the director, writer and producer fared and used those data points to predict this movie's rating.
How can we use it?
Given the accuracy and simplistic nature of the model, we can spot in real time where the outliers are, which is the real fun - if a movie is massively over-performing against expectation, then that’s a huge win for the director/writer/producer and….if it’s under-performing, then we know which ones to avoid!
What next?
As the new movies are released throughout the year, people will vote and the data will keep fresh. If you want to keep on top of how the ratings are going, the dashboard below is automatically updating with the latest scores as the movies come out (once a day).
The strength of the Machine Learning techniques has proved that with a small number of data points, we can get a decent view of movie rating performance. So this opens up further analysis opportunities, that we’ll share in future.
Methodology
With any prediction project, you have to find a balance between effort and reward. The Machine Learning process can be a lengthy one, as we have almost infinite possibilities for feature creation and new algorithms and modifications coming out daily to test.
I took to the task by utilising the previous code and dataset, giving me a baseline table of movie features to start with. This time, I time travelled back to the end of 2021, with a focus on movies coming out in 2022.
I expanded the features to include not only include movies but also TV shows, shorts and other productions, since we’re very aware that TV shows have become extremely high quality over the last few years.
I also added each movie’s director, actors, producer and writers to broaden the movie prediction.
As is my way with any ML project, I like to start simple and use a regression model to establish the baseline modelling table and outputs. After a few rounds of modelling, again the actors were adding very little but the inclusion of the producer and writers was adding a huge amount.
Nevertheless, the regression model was only predicting about 39% correctly, which just isn’t good enough.
The Random Forest model (combines the output of multiple decision trees to reach a single result) has proved very useful in many modelling projects in the past and lived up to its reputation again here, pushing the accuracy up to 83%.
Accuracy defined by the model and what is useful can be two very different things. For our purposes, I want the rating prediction to be in the right ball park so settled on a variance of 0.5 - this means that if I predicted a movie rating of 8.0, I’d accept something in the range of 7.5 to 8.5 as accurate. This gave me an accuracy rating of 80%, so 4 of 5 movie predictions would be “accurate”. If we push this to 0.75, then we’re accounting for 93%, so I felt comfortable with the model performance.
With a reliable prediction model, I could score the 2023 movies (for those that have the data available) and then set about building a measurement process. As the movies are released, the audience votes and scores come, which can be compared against the model. Given our expectation of being in the right ball park, if a movie is performing drastically differently to the expectation, then we can conclude that the movie is under or over performing.
The dashboard (built in Looker Studio) provides a daily refreshed view of how the movies are tracking.
Note: Where a movie has multiple directors/producers/writers, only one was chosen using IMDB’s order field.
While I wait for the next installment of Dune to come out (Nov), I’ll be checking the George Forman movie coming out next month (April).
We specialise in data visualisation at White Box
We can help illuminate your data so that you can make the right, unbiased decisions for your organisation.
As your partner in data visualisation, we’ll help you to realise the full potential of your data and maximise your business success through advanced and innovative solutions that make all the difference.
Get in touch today for your free data strategy consultation.