Classifying music genres with CNNs

fyINjam.jpg!web

A CNN is one of the top technologies for streaming services and leading music platforms. The neural network can identify similar compositions by genre, mood, etc. All this makes the music world much better and more enjoyable. But how? Let’s take a look at essential things like how CNN works, how it relates to music, and how to implement music classifier in TensorFlow.

It is going to be an intense guide, but relax, it will be exciting and fun ;)

Convolutional Neural Networks basics: Quick overview

So, what is the Convolutional neural network (CNN) in machine learning? In a nutshell, it’s an advanced type of neural network that is mainly used for processing images. CNN has an input layer, an output layer, and various hidden layers. Some of these layers are convolutional, using a mathematical model to pass on results to successive layers. This simulates some of the actions in the human visual cortex.

If you give the neural network hundreds and thousands of images of a particular subject, then CNN will process these photos in several layers. The first layers of CNN will distinguish between the gray points and the edges of the image, the next layers will differentiate between CNN shapes, objects and the more of these layers — the greater the identification of the object that dominates those images.

What does it mean? For example, we have a few photos of elephant images, we feed CNN with many elephant photos, then we can show any image, and CNN will tell if there is an elephant or not.

CNN can also process video through pixels that appear on the screen and change, CNN can examine this changing pattern and recognize the object. Of course, this procedure is much more complicated than identifying a simple image. For example, you could use CNN as a model to distinguish objects and learn to recognize them better.

What’s good about CNN?

I can differentiate several things at once:

CNNs have the same timeless parameters as dense layers . *Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer. It’s the most basic layer in neural networks);
CNN has all the image data structures — all pixels, edges, shapes — everything has its position because CNN extracts all that data features and sorts them.

CNN + music

For music, things are a little different, and therefore, more complicated. Any audio recording is a specific amplitude that can also be processed as an image:

7v2AJ3i.png!web

Stereophonic sound

It all sounds like magic, but it’s really just CNN and the simple (or not quite simple) principle behind it. What is the principle? Let’s figure it out.

What is convolution?

How does CNN differ from a simple neural network? The answer is easier than you think, it is — C, or put it differently, a convolution.

A first question to answer with CNNs is why they are called Convolutional in the first place.

Convolution is a mathematical concept used heavily in Digital Signal Processing when dealing with signals that take the form of a time series. Convolution is a mechanism to combine or “blend” two functions of time in a coherent manner. It can be mathematically described as follows:

For a discrete domain of one variable:

For a discrete domain of two variables:

The central point of CNN is convolution, and to understand what it is we need to learn Kernel.

A kernel is a filter that makes a grid of weights like this:

f6FjYjv.png!web

Say we have an image and we need to process it with CNN. Our first step is to apply Kernel to it. How? Let’s see how CNN works in a simple example.

uaYNVjM.png!web

Here is the picture of a cat, and just like any other picture it has a certain amount of pixels with different shades and colors. This picture is black and white so we deal with various shades of these two colors (or more precisely the possible range of values a single pixel can represent is [0, 255]). If assigning each of the shades a certain value like from 0 to 10, we can translate this picture in a grid with all of these numbers representing certain values.

What will happen if we take a colored photo? Nothing special, RGB is our best friend:

6vQzEbi.png!web

Separate color channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the color channels. Thus the image in its entirety constitutes a 3-dimensional structure called the Input Volume (255x255x3).

The next step is what is called a convolution — we overlay the Kernel with a specific formula to change the input data and get an output.

Kernelis an operator applied to the entirety of the image such that it transforms the information encoded in the pixels. In practice, it is a small matrix, which is slid across the image and multiplied with the input such that the output is enhanced in a certain desirable manner.

https://vision.unipv.it/CV/20200113%20-%20Computer%20Vision%20Applications.pdf

So, generally, we use Kernel to extract features. At the beginning, we extract low-level features, but then we go deeper, increase CNN layers and detect the whole object and the smallest details.

Kernel do a larger part of CNN work, but we also have other important parts of this procedure like rectified linear unit (ReLu), the pooling layer, the fully connected layer. All of them create a cutting-edge CNN. Now, I will not stop on each of these parts cause it will take lots of time. If you want to have the whole picture of this procedure in your mind, I recommend watching this video:

To learn more about this you can also follow these links:

CNN for Music Genre Classification

jI3a6fq.png!web

Photo Creds: https://unsplash.com/

As mentioned before, to process audio and extract useful insight from it, we can process it as an image and conduct a standard CNN procedure. A popular method in the audio domain is to use a spectrogram (derived from the Fast Fourier Transform and/or other transformations) as an input to a CNN and to apply convolving filter kernels that extract patterns in 2D.

Everything seems quite logical, but here is the most interesting question: How to classify different music genres? What principle can be behind this task?

Classic algorithm needed for the task of classifying music genres with CNN:

1. create train, validation and test sets

2. build the CNN net

3. compile the network

4. train the CNN

5. evaluate the CNN on the test set

6. make a prediction on a sample

Music genres are a set of descriptive keywords that convey high-level information about a music clip (jazz, classical, rock…). Genre classification is a task that aims to predict music genre using the audio signal.

Building this system requires extracting acoustic features that are good estimators of the type of genres we are interested in, followed by a single or multi-label classification or in some cases, the regression stage. Conventionally, feature extraction relies on a signal processing front-end in order to compute relevant features from time or frequency domain audio

representation. The features are then used as input to the machine learning stage.

CNNs assume features that are in different levels of hierarchy and can be

extracted by convolutional kernels. The hierarchical features are learned to achieve a given task during supervised training. For example, learned features from a CNN that is trained for genre classification exhibit low-level features (e.g., onset) to high-level features (e.g., percussive instrument patterns).