Join Vaidas Armonas, our Machine Learning lead, as he continues his attempt to predict song popularity on Spotify and explores two ML representation design patterns: Rebalancing and Ensembles.
Author: Vaidas Armonas, Machine Learning Lead at Genus AI This post was originally published on his personal blog.
In the first part of this Problem Representation series, we saw that representing what is seemingly a regression problem as a classification problem can bring increased performance. We also saw that constructing a label in a specific way can additionally increase the performance, but results weren’t great: we achieved only a 30% correlation.
To increase performance, we are going to split our problem: separate low popularity and medium-high popularity songs and treat them separately. Thus, this post is about separating unpopular tracks as best as we can. For this, I will take a look at two Machine Learning representation design patterns (discussed in this book): Rebalancing and Ensembles. As the previous time, the code to reproduce these results is on Github.
Looking at the distribution of popularity for our dataset (tracks produced from 2011), we see that a small portion of the songs is very unpopular and have a popularity score below 20:
Low scores are outliers – only around
4% of all popularity scores in our dataset fall below the popularity score of 20. However, our track popularity models developed in Part 1 of this series still tried to accommodate these low scores. Our models might perform better if we remove these “outlier” tracks before predicting the song’s popularity. To do this, we need to classify songs –
unpopular, the ones having Popularity of 20 or less, and
popular, all the rest. Since only 4% of our dataset tracks have such a low popularity score, this is a highly imbalanced problem. To solve such a task, we are going to try a couple of Machine Learning design patterns: Rebalancing and Ensembles.
Rebalancing Design Pattern addresses the problem of the imbalanced dataset in three ways:
Ensemble Design Pattern is a combination of multiple machine learning algorithms trained on subsamples of data to reduce bias, variance, or both. There are three main approaches for ensembles:
We will use bagging in solving our problem since we will train those models on small data, and our goal is to reduce the variance of the combined score.
We are going to go through some experiments to see which of them produces the best result for our use-case. First, let’s take a better look at the data we are working with and define its evaluation metrics.
You will find a bit more EDA in the notebook referenced above.
We have 19,788 tracks collected for the years 2011-2020. We split this dataset randomly to train and test sets: 15,830 and 3,958. We have 11 audio features to predict popularity. Based on the plot provided above, let’s mark tracks with the popularity of 20 and less as
unpopular and other songs as
popular. Having this definition, summary statistics for the label are:
|Statistic||Popularity /train/||Popularity /test/|
Our problem is a standard classification: the only thing we care about is to classify as accurately as possible examples in the test set. Since our dataset is highly imbalanced,
accuracy is not an appropriate metric. We will use
f1-score for evaluation instead as these metrics capture the performance of the model in our setting better.
Let’s start our experiments with the naive approach.
The naive approach is not to modify our dataset or classifier’s hyper-parameters (weights for the classes). Since we evaluate our models based on precision, recall, and f1-score, we need to select a score threshold. We will do this by looking at the recall-precision-f1 graph like the one below.
The best point we can find from the above graph is
0.564 precision, and
0.470 f1-score. The metrics mentioned above correspond to the threshold of
We have our baseline. Next, let’s explore various ways to account for imbalance in our dataset and see the impact on the model performance
One way to achieve rebalancing is through the algorithm parameter rather than some changes to the training data. We can penalize errors made on the minority class examples more and, in this way, bias our algorithm to learn a better representation of the minority class.
After running a few experiments, we achieved the best f1-score with the minority-to-majority weight ratio of 7 to 1. The performance of the model trained with this weighing is as the following:
The above graph’s optimal point is an f1-score of
0.489 (a 4% improvement on the naive approach), the precision of
0.524, recall of
0.457. We achieve these metrics with the threshold of
Now let’s see what impact the rebalancing of the data brings has.
Dealing with the imbalanced dataset can be a series of blog posts in itself. Here we will look into a couple of techniques for
undersampling and later for
oversampling to get a sense of how these techniques perform. Overall, throwing out data is unwise, so we expect oversampling to come out on top. Let’s see what our experiments will show.
We are going to use a great package to make this process easy for us: the imbalanced-learn.
Random undersampling is the most straightforward approach we can take to alter our dataset: randomly discarding some of the majority class records to achieve a specified “balanced” level. Running a few experiments for different values of this balance, we have found that the best performing ratio is
0.2, undersampling until the balance is 20% minority examples and 80% majority examples (the original dataset was 4% minority and 96% majority).
The above graph’s optimal point is an f1-score of
0.484, the precision of
0.469, recall of
0.500. We achieve these metrics with the threshold of
0.423. We see that this approach gives us a model with a slightly different precision-recall tradeoff: we have a better recall with this model and a somewhat lower precision score. With the f1-score being almost the same, we can pick a model that better fits our use-case in terms of this tradeoff later. Next, let’s investigate a bit cleverer the undersampling technique.
NearMiss method is based on a paper by Zhang J. and Mani I. (2003). There are three variations of the algorithm:
After running nine experiments for each of the above variations, we achieve the best result for
version 3 with the equal sampling strategy. This way, we downsampled the majority class to a 50/50 ratio.
Given how much data we discard, it is no surprise that results are worse than those of other methods. We achieve the best performance with the threshold of
0.692 – the precision of
0.483, recall of
0.341, and f1-score of
0.400. Let’s see if oversampling can improve results.
It is not a surprise that we lose model performance as we discard records from our dataset. Our dataset is not very big to begin with, so we will try a couple of oversampling methods, SMOTE and ADASYN, next.
SMOTE was proposed in the paper by Chawla et al. (2002) as a way to improve model performance on imbalanced datasets and as an alternative to random oversampling with replacement. Instead of oversampling actual records, this approach creates synthetic examples from the minority class. Let’s see how it performs on our dataset.
Running several experiments with different oversampling ratios, we find that oversampling the dataset to the 20/80 level gives us the best performing model.
The best performance is with the threshold of
0.567 – the precision of
0.500, recall of
0.396, and f1-score of
0.442. Oversampling with SMOTE produced a better model than NearMiss undersampling. However, it still does not outperform any of the naive, weighted, or models produced by randomly undersampling records. Let’s investigate another oversampling technique.
ADASYN is the algorithm proposed in Haibo et al. (2008). The main difference from SMOTE is that ADASYN generates more synthetic examples that are harder to learn instead of treating every minority group example equally.
As with SMOTE, ADASYN oversampling produces the best model with the dataset balance of 20/80.
As it is evident from the graph, oversampling with ADASYN gives us a model that outperforms just one other, the NearMiss model: the precision of
0.453, recall of
0.378, and overall f1-score of
The Ensemble design pattern is useful for many problems, not just for imbalanced datasets. The model performance will be more stable by training and aggregating multiple algorithms that have seen different parts of the training dataset. There are different ensemble techniques to target bias or variance issues, and there are algorithms that already incorporate this technique: Random forest (bagging) and Boosted trees (boosting). However, there is a twist with an imbalanced dataset.
We want to be smart about selecting a subsample of dataset to train one of our ensemble models. One way to be smart about downsampling, is to downsample only the majority class. We will get every model trained on slightly different dataset, and keep all the minority class information in each of them. We can repeat this process as many times as needed for a stable performance. In our case, the prediction of song popularity, aggregating 45 models produced the best result.
By running several experiments, we have found that the best ensemble has 45 models. That is a lot, but models are tree-based algorithms, so the inference is not snail-slow.
Ensemble approach gives us the third-best model with the precision of
0.481, recall of
0.470, and f1-score of
0.475 at a threshold of
When putting all our results together, we get the following table:
|Weighting training examples||0.524||0.457||0.489||Best performing weighting at 7-to-1; threshold – 0.545|
|Random undersampling||0.469||0.500||0.484||Best performing undersampling to 20/80; threshold – 0.423|
|Ensemble||0.481||0.470||0.475||Best performing ensemble was with 45 models; threshold – 0.733|
|Naive||0.564||0.402||0.470||Threshold – 0.205|
|SMOTE oversampling||0.500||0.396||0.442||Best performing oversampling at 20/80; threshold – 0.567|
|ADASYN oversampling||0.453||0.378||0.412||Best performing oversampling at 20/80; threshold – 0.578|
|NearMiss undersampling||0.483||0.341||0.400||Best performing undersampling was done with version 3 to 50/50; threshold – 0.692|
When evaluating models on the highest f1-score achieved, we have two close contenders with slightly different performance profiles. The model build by weighing training examples has a higher precision but lower recall, while the model build by randomly undersampling the majority class in training data has a higher recall and lower precision. The next two models (the 3rd and 4th by f1 score) are close in terms of the f1 score but are slightly different in the performance profile. The ensemble model is relatively balanced, with recall and precision being near, while the Naive model has the highest precision among selected models, with recall being barely above 0.4.
We set out on this imbalanced classification journey to “clean” our dataset from unpopular tracks before predicting the popularity for “normal” tracks. By applying our top-performing model, we get the following popularity distribution:
We remove around 60% of the unpopular songs with our model. We want to remove more, but we don’t know if this will be enough to impact the overall popularity prediction performance. We will investigate this in Part 3 of this Problem Representation Design Patterns series. We will bring together the last two posts and combine them with the Cascade design pattern.