Back to Blog# Machine Learning Design Patterns: Problem Representation (Part 1)

Join Vaidas Armonas, our Machine Learning lead, as he explores two ML design patterns, Reframing and Neutral Class, in an attempt to predict song popularity on Spotify.

Author: Vaidas Armonas, Machine Learning Lead at Genus AI

This post was originally published on his personal blog.

In my previous post I have discussed data representation patterns presented in **Machine Learning Design Patterns by V. Lakshmanan, S. Robinson & M. Munn**. In this post I would like to talk about the next topic in the above-mentioned book: problem representation. After taking care of our data representation, this is the next logical step (and therefore the next chapter in the book).

This is also probably the most important decision to make for a Machine Learning problem: the decision how to model a given problem will define how well our solution will perform. The good bit is that we do not need to make this decision correct from the start. As with everything in Machine Learning, it is an iterative process, and when you find that your problem cannot be solved by regression, try classification (always try classification if you can).

I will do it differently this time – instead of just discussing patterns, I will define a task which we will solve using different design patterns. This way we will be able to compare results and see the influence of problem representation.

In this post, I will concentrate on **Reframing** and **Neutral Class** design patterns. Next time I will cover **Rebalancing** and **Ensemble** design patterns. But first, let’s define our task.

To illustrate the above-mentioned design patterns I will try to predict track popularity on Spotify only using track (mostly audio) features such as *danceability*, *liveness* and *tempo*. Data can be downloaded from Kaggle as well as the full list of features used. I will provide the accompanying notebook shortly.

**Popularity** is a Spotify metric calculated for each track mostly based on the number of plays and the recency of those plays. Given the above definition, we expect that new songs will be more popular on average. So we will limit ourselves to the tracks that were released in the past decade. Again, for more details, see the notebook.

As I have already mentioned, in this post I would like to discuss and illustrate two problem representation design patterns: **Reframing** and **Neutral Class**. The former is any data scientist’s bread and butter. As for the latter, it is much rarer (I haven’t seen it anywhere else so far).

The task is simple: we want to predict which song is popular. Popularity rating is an integer from 1 to 100. It is not a real-valued target, as, for example, a price of a house would be, but regression is a reasonable approach here. However, if the only thing we want to predict is popularity (not the rating itself), we can make this task a classification problem by thresholding the popularity index.

More on data can be found in the notebook referenced above.

We have 19,788 tracks collected for the years 2011-2020. We split this dataset randomly to train and test sets – 15,830 and 3,958. We have 11 audio features to predict popularity from.

The popularity has the following distributions for train and test splits:

It is not ideal, but close to that. Statistics are close too:

Statistic | Popularity /train/ | Popularity /test/ |

Count | 15,830 | 3,958 |

Mean | 58.89 | 58.58 |

Std | 15.30 | 15.14 |

Min | 0 | 0 |

25th | 54 | 53 |

50th | 61 | 60 |

75th | 67 | 67 |

Max | 99 | 100 |

For these types of problems, I use correlation (rank correlation) as an evaluation metric. Since I want to know which tracks are/will be popular, I am interested to know if my predicted score indicates higher actual popularity, and that is the correlation. I will use **Spearman’s Rho** and **Kendall’s Tau** with corresponding p-values.

As mentioned, regression is a reasonable approach to model popularity. Running it with sklearn’s *GradientBoostingRegressor* produces the following results:

**Correlation coefficients**

Metric | Coefficient | p-value |

Spearman’s Rho | 0.270 | 5.25e-67 |

Kendall’s Tau | 0.184 | 6.20e-66 |

There is a statistically significant correlation between predicted and actual popularity. But let’s see if we can do even better.

We can reframe our original regression problem to classification by thresholding the popularity index to create classes for our model to predict. This usually works better than regression, because we simplify the problem a bit. If the only thing we need is a relative ordering of the songs, the scores from the classification model are perfectly good for such a goal.

The simplest approach for binary class creation is to split the scores at the median value. Using such an approach, we will have nicely distributed training data. This might not work if values in your dataset are very skewed towards one or the other end of the scale. In that case, you will have to experiment to see whether having the imbalanced dataset works out or some re-balancing design patterns (the subject of my next post!) should be employed.

**Correlation coefficients**

Metric | Coefficient | p-value |

Spearman’s Rho | 0.306 | 7.69e-87 |

Kendall’s Tau | 0.210 | 3.73e-85 |

The improvement is at 11% and 14% in Spearman’s and Kendall’s correlations respectively. That is not bad for some simple thresholding. Let’s see if we can improve this result.

Another trick that I always experiment with is trying to simplify a problem for the algorithm by removing middle values. Depending on your problem this might help for the model to distinguish between `good`

and `bad`

examples better. However, this means that we are discarding some of our data and, therefore, this tradeoff will improve results when an amount of discarded data is less costly than the ambiguity that is removed.

**Correlation coefficients**

Metric | Coefficient | p-value |

Spearman’s Rho | 0.303 | 7.98e-87 |

Kendall’s Tau | 0.207 | 1.39e-82 |

In our example, I took top and bottom 40 percent of the dataset, and, as we can see, this ended up hurting the performance. I also varied the amount of data I removed – an approach, that always has a detrimental effect on the model performance, the same as in this case.

The above example tried to remove ambiguity introduced by splitting continuous variable at a threshold, but hurt model performance because part of the data was removed. The **Neutral Class** design pattern takes care of that. Instead of removing part of the data, we give it a class and use it for the prediction. Then, at inference time, we only look at the `high`

class probability.

The neutral class design pattern is also useful when we train models on human labelers output, medical imaging applications for example. When human labelers do not agree, we can represent that uncertainty to our model via neutral class.

**Correlation coefficients**

Metric | Coefficient | p-value |

Spearman’s Rho | 0.308 | 8.23e-88 |

Kendall’s Tau | 0.211 | 7.16e-86 |

As you can see, this increased the Spearman’s and Kendall’s metrics by 1% and 0.5% respectively compared to the previous simple classification approach. This will not make-or-break your Machine Learning project, but the effect it has also will depend on the dataset. With such a simple change, any positive effect is welcome.

When looking at the scatterplots (or even looking at the label distribution), we see that more can be done to increase model performance. We could train a model to distinguish `0`

(or very low) scores from the higher ones and then train another model to predict higher scores on the output of the first. This is called the **Cascade** design pattern and I will write about it in the future.

In this post, we explored two Machine Learning design patterns, **Reframing** and **Neutral Class**. I have shown that these can work in tandem by reframing a problem at hand from regression to classification and then adding a neutral class to help our model distinguish the high and low values better. These two steps add a nice performance boost if we can live with having an estimate of a probability of the `high class`

rather than an estimate of the `value itself`

.

In my next post, I will take a look into **Rebalancing** and **Ensembles** design patterns. I think they can be useful every day since the most of the interesting problems are imbalanced by nature (predicting high-value customers, churners, etc.).