Using Deep Convolutional Neural Networks to Classify Music (TENSORFLOW)


Convolutional neural networks for image recognition can be used to classify music. Why? Because the time series information of music (sound) can be extracted to be in a form similar to image presentation.

The objective here is to classify music samples into a specific genre using convolutional networks. There are three genres - Alternative, Rap/Hiphop, and Rock.

Data set

  • Music Audio Benchmark Data Set [Homburg et al. (2005) and Mierswa et al. (2005)]
  • 294 song excerpts from each of 3 genres in mp3 format
  • file frequency: 44.1 kHz
  • file bitrate: 128 kb
  • duration: 10 seconds each
  • train : test data split: 80% : 20% (236 : 58)
  • For each excerpts of the training data, we extracted 12 segments with length of 5 seconds to be a sample. In total 2832 training samples.
  • For each 5 second signal, we extracted 22050 * 5 time points, with sample rate= 22050.
  • MFCC for each sample is of dimension 40 X 217

Figure shows the three classes of music’s MFCC image.

Feature extraction for audio signals

  • Low-level features: spectral, spectrotemporal features
  • High-level features: tempo, rhythm, key, pitch, and harmonic information
  • Low-level features used here as we want deep learning to learn the complex features by itself. Mel frequency cepstral coefficients (MFCCs) are used as features here.
  • MFCCs are supposed to mimic the logarithmic perception of loudness and pitch of the human auditory system. They have become very popular as features for problems within the music information retrieval community.
  • All audio signal processing was done using librosa [1].


We used a similar model proposed by Choi et al., 2016 [2], with 3 convolutional layers and 2 fully connected layers. We also applied a normalization step before entering the fully connected layer. This implements with accuracy of 86%.

TensorBoard shows the graph of the model:

TensorBoard shows the testing accuracy and loss change of the model during training:


  1. I also tried the following two models: 2 convolutional layers and 2 fully connected layers; 3 convolutional layers and 2 fully connected layers, with an accuracy at around 75% and 81% separately. The running time does not differ much across the three models, but with a data normalizaton step the performance is better.

  2. Actually I have several other genres of music, but with fewer samples than this three. I tried to implement the same models to classify 9 genres, with different number of samples from each genre. This result 53% accuracy. Lesson from this implementation: If you sample from a specific class is farely fewer than the other classes (meaning your sample is really skewed…), you need to consider whether it worth trying before you act..


Source code could be found on my github.


  1. Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, 2015.


Written on February 2, 2017