Analysing music habits with Spotify API and Python

Oct 14, 2019

I’m using Spotify since 2013 as the main source of music, and back at that time the app automatically created a playlist for songs that I liked from artists’ radios. By innertion I’m still using the playlist to save songs that I like. As the playlist became a bit big and a bit old (6 years, huh), I’ve decided to try to analyze it.

Boring preparation

To get the data I used Spotify API and spotipy as a Python client. I’ve created an application in the Spotify Dashboard and gathered the credentials. Then I was able to initialize and authorize the client:

import spotipy
import spotipy.util as util

token = util.prompt_for_user_token(user_id,
                                   'playlist-read-collaborative',
                                   client_id=client_id,
                                   client_secret=client_secret,
                                   redirect_uri='http://localhost:8000/')
sp = spotipy.Spotify(auth=token)

Tracks metadata

As everything is inside just one playlist, it was easy to gather. The only problem was that user_playlist method in spotipy doesn’t support pagination and can only return the first 100 track, but it was easily solved by just going down to private and undocumented _get:

playlist = sp.user_playlist(user_id, playlist_id)
tracks = playlist['tracks']['items']
next_uri = playlist['tracks']['next']
for _ in range(int(playlist['tracks']['total'] / playlist['tracks']['limit'])):
    response = sp._get(next_uri)
    tracks += response['items']
    next_uri = response['next']

tracks_df = pd.DataFrame([(track['track']['id'],
                           track['track']['artists'][0]['name'],
                           track['track']['name'],
                           parse_date(track['track']['album']['release_date']) if track['track']['album']['release_date'] else None,
                           parse_date(track['added_at']))
                          for track in playlist['tracks']['items']],
                         columns=['id', 'artist', 'name', 'release_date', 'added_at'] )

tracks_df.head(10)

id artist name release_date added_at 0 1MLtdVIDLdupSO1PzNNIQg Lindstrøm & Christabelle Looking For What 2009-12-11 2013-06-19 08:28:56+00:00 1 1gWsh0T1gi55K45TMGZxT0 Au Revoir Simone Knight Of Wands - Dam Mantle Remix 2010-07-04 2013-06-19 08:48:30+00:00 2 0LE3YWM0W9OWputCB8Z3qt Fever Ray When I Grow Up - D. Lissvik Version 2010-10-02 2013-06-19 22:09:15+00:00 3 5FyiyLzbZt41IpWyMuiiQy Holy Ghost! Dumb Disco Ideas 2013-05-14 2013-06-19 22:12:42+00:00 4 5cgfva649kw89xznFpWCFd Nouvelle Vague Too Drunk To Fuck 2004-11-01 2013-06-19 22:22:54+00:00 5 3IVc3QK63DngBdW7eVker2 TR/ST F.T.F. 2012-11-16 2013-06-20 11:50:58+00:00 6 0mbpEDdZHNMEDll6woEy8W Art Brut My Little Brother 2005-10-02 2013-06-20 13:58:19+00:00 7 2y8IhUDSpvsuuEePNLjGg5 Niki & The Dove Somebody (drum machine version) 2011-06-14 2013-06-21 09:28:40+00:00 8 1X4RqFAShNL8aHfUIpjIVr Gorillaz Kids with Guns - Hot Chip Remix 2007-11-19 2013-06-23 19:00:57+00:00 9 1cV4DVeAM5AstrDlXgvzJ7 Lykke Li I'm Good, I'm Gone 2008-01-28 2013-06-23 22:31:52+00:00

The first naive idea of data to get was the list of the most appearing artists:

tracks_df \
    .groupby('artist') \
    .count()['id'] \
    .reset_index() \
    .sort_values('id', ascending=False) \
    .rename(columns={'id': 'amount'}) \
    .head(10)

artist amount 260 Pet Shop Boys 12 334 The Knife 11 213 Metronomy 9 303 Soulwax 8 284 Röyksopp 7 180 Ladytron 7 94 Depeche Mode 7 113 Fever Ray 6 324 The Chemical Brothers 6 233 New Order 6

But as taste can change, I’ve decided to get top five artists from each year and check if I was adding them to the playlist in other years:

counted_year_df = tracks_df \
    .assign(year_added=tracks_df.added_at.dt.year) \
    .groupby(['artist', 'year_added']) \
    .count()['id'] \
    .reset_index() \
    .rename(columns={'id': 'amount'}) \
    .sort_values('amount', ascending=False)

in_top_5_year_artist = counted_year_df \
    .groupby('year_added') \
    .head(5) \
    .artist \
    .unique()

counted_year_df \
    [counted_year_df.artist.isin(in_top_5_year_artist)] \
    .pivot('artist', 'year_added', 'amount') \
    .fillna(0) \
    .style.background_gradient()

year_added 2013 2014 2015 2016 2017 2018 2019 artist Arcade Fire 2 0 0 1 3 0 0 Clinic 1 0 0 2 0 0 1 Crystal Castles 0 0 2 2 0 0 0 Depeche Mode 1 0 3 1 0 2 0 Die Antwoord 1 4 0 0 0 1 0 FM Belfast 3 3 0 0 0 0 0 Factory Floor 3 0 0 0 0 0 0 Fever Ray 3 1 1 0 1 0 0 Grimes 1 0 3 1 0 0 0 Holy Ghost! 1 0 0 0 3 1 1 Joe Goddard 0 0 0 0 3 1 0 John Maus 0 0 4 0 0 0 1 KOMPROMAT 0 0 0 0 0 0 2 LCD Soundsystem 0 0 1 0 3 0 0 Ladytron 5 1 0 0 0 1 0 Lindstrøm 0 0 0 0 0 0 2 Marie Davidson 0 0 0 0 0 0 2 Metronomy 0 1 0 6 0 1 1 Midnight Magic 0 4 0 0 1 0 0 Mr. Oizo 0 0 0 1 0 3 0 New Order 1 5 0 0 0 0 0 Pet Shop Boys 0 12 0 0 0 0 0 Röyksopp 0 4 0 3 0 0 0 Schwefelgelb 0 0 0 0 1 0 4 Soulwax 0 0 0 0 5 3 0 Talking Heads 0 0 3 0 0 0 0 The Chemical Brothers 0 0 2 0 1 0 3 The Fall 0 0 0 0 0 2 0 The Knife 5 1 3 1 0 0 1 The Normal 0 0 0 2 0 0 0 The Prodigy 0 0 0 0 0 2 0 Vitalic 0 0 0 0 2 2 0

As a bunch of artists was reappearing in different years, I decided to check if that correlates with new releases, so I’ve checked the last ten years:

counted_release_year_df = tracks_df \
    .assign(year_added=tracks_df.added_at.dt.year,
            year_released=tracks_df.release_date.dt.year) \
    .groupby(['year_released', 'year_added']) \
    .count()['id'] \
    .reset_index() \
    .rename(columns={'id': 'amount'}) \
    .sort_values('amount', ascending=False)

counted_release_year_df \
    [counted_release_year_df.year_released.isin(
        sorted(tracks_df.release_date.dt.year.unique())[-11:]
    )] \
    .pivot('year_released', 'year_added', 'amount') \
    .fillna(0) \
    .style.background_gradient()

year_added 2013 2014 2015 2016 2017 2018 2019 year_released 2010.0 19 8 2 10 6 5 10 2011.0 14 10 4 6 5 5 5 2012.0 11 15 6 5 8 2 0 2013.0 28 17 3 6 5 4 2 2014.0 0 30 2 1 0 10 1 2015.0 0 0 15 5 8 7 9 2016.0 0 0 0 8 7 4 5 2017.0 0 0 0 0 23 5 5 2018.0 0 0 0 0 0 4 8 2019.0 0 0 0 0 0 0 14

Audio features

Spotify API has an endpoint that provides features like danceability, energy, loudness and etc for tracks. So I gathered features for all tracks from the playlist:

features = []
for n, chunk_series in tracks_df.groupby(np.arange(len(tracks_df)) // 50).id:
    features += sp.audio_features([*map(str, chunk_series)])
features_df = pd.DataFrame.from_dict(filter(None, features))
tracks_with_features_df = tracks_df.merge(features_df, on=['id'], how='inner')

tracks_with_features_df.head()

id artist name release_date added_at danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature 0 1MLtdVIDLdupSO1PzNNIQg Lindstrøm & Christabelle Looking For What 2009-12-11 2013-06-19 08:28:56+00:00 0.566 0.726 0 -11.294 1 0.1120 0.04190 0.494000 0.282 0.345 120.055 359091 4 1 1gWsh0T1gi55K45TMGZxT0 Au Revoir Simone Knight Of Wands - Dam Mantle Remix 2010-07-04 2013-06-19 08:48:30+00:00 0.563 0.588 4 -7.205 0 0.0637 0.00573 0.932000 0.104 0.467 89.445 237387 4 2 0LE3YWM0W9OWputCB8Z3qt Fever Ray When I Grow Up - D. Lissvik Version 2010-10-02 2013-06-19 22:09:15+00:00 0.687 0.760 5 -6.236 1 0.0479 0.01160 0.007680 0.417 0.818 92.007 270120 4 3 5FyiyLzbZt41IpWyMuiiQy Holy Ghost! Dumb Disco Ideas 2013-05-14 2013-06-19 22:12:42+00:00 0.752 0.831 10 -4.407 1 0.0401 0.00327 0.729000 0.105 0.845 124.234 483707 4 4 5cgfva649kw89xznFpWCFd Nouvelle Vague Too Drunk To Fuck 2004-11-01 2013-06-19 22:22:54+00:00 0.461 0.786 7 -6.950 1 0.0467 0.47600 0.000003 0.495 0.808 159.882 136160 4

After that I’ve checked changes in features over time, only instrumentalness had some visible difference:

sns.boxplot(x=tracks_with_features_df.added_at.dt.year,
            y=tracks_with_features_df.instrumentalness)

Then I had an idea to check seasonality and valence, and it kind of showed that in depressing months valence is a bit lower:

sns.boxplot(x=tracks_with_features_df.added_at.dt.month,
            y=tracks_with_features_df.valence)

To play a bit more with data, I decided to check that danceability and valence might correlate:

tracks_with_features_df.plot(kind='scatter', x='danceability', y='valence')

And to check that the data is meaningful, I checked instrumentalness vs speechiness, and those featues looked mutually exclusive as expected:

tracks_with_features_df.plot(kind='scatter', x='instrumentalness', y='speechiness')

Tracks difference and similarity

As I already had a bunch of features classifying tracks, it was hard not to make vectors out of them:

encode_fields = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'duration_ms',
    'time_signature',
]

def encode(row):
    return np.array([
        (row[k] - tracks_with_features_df[k].min())
        / (tracks_with_features_df[k].max() - tracks_with_features_df[k].min())
        for k in encode_fields])

tracks_with_features_encoded_df = tracks_with_features_df.assign(
    encoded=tracks_with_features_df.apply(encode, axis=1))

Then I just calculated distance between every two tracks:

tracks_with_features_encoded_product_df = tracks_with_features_encoded_df \
    .assign(temp=0) \
    .merge(tracks_with_features_encoded_df.assign(temp=0), on='temp', how='left') \
    .drop(columns='temp')
tracks_with_features_encoded_product_df = tracks_with_features_encoded_product_df[
    tracks_with_features_encoded_product_df.id_x != tracks_with_features_encoded_product_df.id_y
]
tracks_with_features_encoded_product_df['merge_id'] = tracks_with_features_encoded_product_df \
    .apply(lambda row: ''.join(sorted([row['id_x'], row['id_y']])), axis=1)
tracks_with_features_encoded_product_df['distance'] = tracks_with_features_encoded_product_df \
    .apply(lambda row: np.linalg.norm(row['encoded_x'] - row['encoded_y']), axis=1)

After that I was able to get most similar songs/songs with the minimal distance, and it selected kind of similar songs:

tracks_with_features_encoded_product_df \
    .sort_values('distance') \
    .drop_duplicates('merge_id') \
    [['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
    .head(10)

artist_x name_x release_date_x artist_y name_y release_date_y distance 84370 Labyrinth Ear Wild Flowers 2010-11-21 Labyrinth Ear Navy Light 2010-11-21 0.000000 446773 YACHT I Thought the Future Would Be Cooler 2015-09-11 ADULT. Love Lies 2013-05-13 0.111393 21963 Ladytron Seventeen 2011-03-29 The Juan Maclean Give Me Every Little Thing 2005-07-04 0.125358 11480 Class Actress Careful What You Say 2010-02-09 MGMT Little Dark Age 2017-10-17 0.128865 261780 Queen of Japan I Was Made For Loving You 2001-10-02 Midnight Juggernauts Devil Within 2007-10-02 0.131304 63257 Pixies Bagboy 2013-09-09 Kindness That's Alright 2012-03-16 0.146897 265792 Datarock Computer Camp Love 2005-10-02 Chromeo Night By Night 2010-09-21 0.147235 75359 Midnight Juggernauts Devil Within 2007-10-02 Lykke Li I'm Good, I'm Gone 2008-01-28 0.152680 105246 ADULT. Love Lies 2013-05-13 Dr. Alban Sing Hallelujah! 1992-05-04 0.154475 285180 Gigamesh Don't Stop 2012-05-28 Pet Shop Boys Paninaro 95 - 2003 Remaster 2003-10-02 0.156469

The most different songs weren’t that fun, as two songs were too different from the rest:

tracks_with_features_encoded_product_df \
    .sort_values('distance', ascending=False) \
    .drop_duplicates('merge_id') \
    [['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
    .head(10)

artist_x name_x release_date_x artist_y name_y release_date_y distance 79324 Labyrinth Ear Navy Light 2010-11-21 Boy Harsher Modulations 2014-10-01 2.480206 84804 Labyrinth Ear Wild Flowers 2010-11-21 Boy Harsher Modulations 2014-10-01 2.480206 400840 Charlotte Gainsbourg Deadly Valentine - Soulwax Remix 2017-11-10 Labyrinth Ear Navy Light 2010-11-21 2.478183 84840 Labyrinth Ear Wild Flowers 2010-11-21 Charlotte Gainsbourg Deadly Valentine - Soulwax Remix 2017-11-10 2.478183 388510 Ladytron Paco! 2001-10-02 Labyrinth Ear Navy Light 2010-11-21 2.444927 388518 Ladytron Paco! 2001-10-02 Labyrinth Ear Wild Flowers 2010-11-21 2.444927 20665 Factory Floor Fall Back 2013-01-15 Labyrinth Ear Navy Light 2010-11-21 2.439136 20673 Factory Floor Fall Back 2013-01-15 Labyrinth Ear Wild Flowers 2010-11-21 2.439136 79448 Labyrinth Ear Navy Light 2010-11-21 La Femme Runway 2018-10-01 2.423574 84928 Labyrinth Ear Wild Flowers 2010-11-21 La Femme Runway 2018-10-01 2.423574

Then I calculated the most avarage songs, eg the songs with the least distance from every other song:

tracks_with_features_encoded_product_df \
    .groupby(['artist_x', 'name_x', 'release_date_x']) \
    .sum()['distance'] \
    .reset_index() \
    .sort_values('distance') \
    .head(10)

artist_x name_x release_date_x distance 48 Beirut No Dice 2009-02-17 638.331257 591 The Juan McLean A Place Called Space 2014-09-15 643.436523 347 MGMT Little Dark Age 2017-10-17 645.959770 101 Class Actress Careful What You Say 2010-02-09 646.488998 31 Architecture In Helsinki 2 Time 2014-04-01 648.692344 588 The Juan Maclean Give Me Every Little Thing 2005-07-04 648.878463 323 Lindstrøm Baby Can't Stop 2009-10-26 652.212858 307 Ladytron Seventeen 2011-03-29 652.759843 310 Lauer Mirrors (feat. Jasnau) 2018-11-16 655.498535 451 Pet Shop Boys Always on My Mind 1998-03-31 656.437048

And totally opposite thing – the most outstanding songs:

tracks_with_features_encoded_product_df \
    .groupby(['artist_x', 'name_x', 'release_date_x']) \
    .sum()['distance'] \
    .reset_index() \
    .sort_values('distance', ascending=False) \
    .head(10)

artist_x name_x release_date_x distance 665 YACHT Le Goudron - Long Version 2012-05-25 2823.572387 300 Labyrinth Ear Navy Light 2010-11-21 1329.234390 301 Labyrinth Ear Wild Flowers 2010-11-21 1329.234390 57 Blonde Redhead For the Damaged Coda 2000-06-06 1095.393120 616 The Velvet Underground After Hours 1969-03-02 1080.491779 593 The Knife Forest Families 2006-02-17 1040.114214 615 The Space Lady Major Tom 2013-11-18 1016.881467 107 CocoRosie By Your Side 2004-03-09 1015.970860 170 El Perro Del Mar Party 2015-02-13 1012.163212 403 Mr.Kitty XIII 2014-10-06 1010.115117

Conclusion

Although the dataset is a bit small, it was still fun to have a look at the data.

Gist with a jupyter notebook with even more boring stuff, can be reused by modifying credentials.

Analysing music habits with Spotify API and Python

Analysing music habits with Spotify API and Python

Boring preparation

Tracks metadata

Audio features

Tracks difference and similarity

Conclusion

Recommend

Sound lights with Spotify and ESP8266

Tyler Akidau, Slava Chernyak, Reuven Lax: Streaming Systems

Alex Petrov: Database Internals

Sam Newman: Monolith To Microservices

Brendan Burns: Designing Distributed Systems

Titus Winters, Tom Manshreck, Hyrum Wright: Software Engineering at Google

Andrew S. Grove: High Output Management

Julie Zhuo: The Making of a Manager

Neal Ford, Rebecca Parsons, Patrick Kua: Building Evolutionary Architectures

Composition is not what you think it is! Why “nearly invertible” isn’t.

About Joyk