Spotify Wrapped

25 May 2023 · # Analytics · 1115 words

Spotify Wrapped is a viral marketing campaign by the music streaming service Spotify. At the end of every year, Spotify presents each user with a summary of their listening habits.¹

I set out to see whether users can execute Spotify Wrapped-like queries against the Spotify Web API. Before I embarked on my journey, I was unaware of third-party tools such as stats.fm.²

Keep reading to know what is possible with a laptop and API access. Or use stats.fm.

Keys to the Kingdom

Like any sane internet service, Spotify guards their public API with developer keys.³ In particular, Spotify implements the OAuth 2.0 authorization framework.

Go through the OAuth Authorization Code flow to allow a machine -- your laptop -- to act on your behalf. Spotify for Developers documents the gory details.

The result of the interaction should be an access token. API servers only respond to requests containing a valid access token.

Data wrangling

Spotify organizes data in the following way:

Every track is associated with an album
Every track is associated with one or more artists
Every album is an "album," a "single," or a "compilation"
An album may be associated with one or more genres
An artist may be associated with one or more genres

Categorizing tracks by genre is subjective. In practice, only artists, not albums, are assigned genres. More on that later.

Assuming you have a valid access token, you can call the saved tracks endpoint to fetch "liked" tracks:

curl \
  --request GET \
  --url https://api.spotify.com/v1/me/tracks \
  --header 'Authorization: Bearer acc3sst0k3n'

Spotify pages the response for obvious reasons. Use your favorite scripting language to fetch the whole library.⁴

Please be considerate and run your queries against cached data. Not directly against the Web API. I wrote the saved tracks API response to CSV files and loaded them into SQLite. The rest of this blog post will make sense if you do the same thing.

Each CSV file corresponds to an SQLite table. The table schemas mirror the way Spotify manages track data.

Execute

create table album (
  album_id text primary key not null,
  name text not null,
  type text not null,
  release_year integer not null
);

create table artist (
  artist_id text primary key not null,
  name text not null
);

create table track (
  track_id text primary key not null,
  name text not null,
  album_id text not null references album(id)
);

create table track_artist (
  track_id text not null references track(track_id),
  artist_id text not null references artist(artist_id),
  primary key (track_id, artist_id)
);

to apply the schemas, and then load data with the following commands:

.import track.csv track --csv
.import album.csv album --csv
.import artist.csv artist --csv
.import track_artist.csv track_artist --csv

Popular tracks

We can't determine the most streamed tracks from the Web API. It does not expose track play counts.⁵ Indeed, this is why the track table does not have a play_count column.

stats.fm records track play count on an ongoing basis. If you import historical data, it displays play counts from the beginning of time.

Popular albums

Let's define an album's popularity by the number of saved album tracks.

To find popular albums,

join each track to an album by its album ID,
group tracks by album, and
sort by frequency.

Here is a query that returns albums that have at least five tracks:

select album.name, count(*) as count
from track
join album using(album_id)
group by album_id
having count > 4
order by count desc, album.name;

Note that publishers can re-upload albums under a different ID. As far as I can see, this has two consequences:

Multiple Count: users save the same track multiple times
Split Count: users save different tracks across copies of an album

Neither of these phenomena is easy to notice while browsing Spotify.⁶ Unfortunately, they complicate our query and leave orphaned tracks in your library.

Group tracks by album name, not ID, to counteract Split Count:

select album.name, count(*) as count
from track
join album using(album_id)
group by album.name
having count > 4
order by count desc, album.name;

Deduplicate tracks by name to compensate for Multiple Count:

select album.name, count(distinct track.name) as count
from track
join album using(album_id)
group by album.name
having count > 4
order by count desc, album.name;

The last query outputs the table

Album	Saved tracks
Lover	8
American Idiot	6
Back In Black	5
Brothers In Arms	5
Hard Road	5
Parallel Lines	5
Rumours	5
The Last Stand	5
Voulez-Vous	5
÷	5

when run on my library. (Note: I've removed soundtracks.)

Lover by Taylor Swift is my favorite album 😅. How embarrassing!

Popular artists

The query to find popular artists is similar to that used to find popular albums. This time, you must join three tables:

select artist.name, count(distinct track.name) as count
from track
join track_artist using(track_id)
join artist using(artist_id)
group by artist_id
having count > 10
order by count desc, artist.name;

Artist profiles are almost immutable, so we don't have a Split Count problem. However, we have a Multiple Count problem. Address it in the same way as before -- deduplicate tracks by name.

This query outputs the table

Artist	Saved tracks
Elton John	21
ABBA	20
Sabaton	19
AC/DC	18
Dire Straits	18
Taylor Swift	18
David Bowie	15
Ed Sheeran	13
Bee Gees	12
Queen	12
The Killers	12
Take That	11

when run on my library.

The results are unsurprising. Almost all featured artists enjoy -- or have enjoyed -- enormous commercial success. Sabaton is the only unusual result.

Favorite era

Let's define the popularity of an era by the number of library tracks published in that period.

You can find the number of tracks published per year by grouping tracks by publication year:

select release_year, count(distinct track.name) as count
from track
join album using(album_id)
group by release_year
order by count desc, release_year desc;

(2013 and 1984 were good years for music, at least as far as I am concerned.)

Counting library tracks by release decade is more interesting. You don't need a window function to do this. Notice that the release decade of a track is given by release_year - release_year % 10, so the query

select
  release_year - release_year % 10 as release_decade,
  count(distinct track.name) as count
from track
join album using(album_id)
group by release_decade
order by count desc, release_decade desc;

partitions tracks by release decade.

It outputs

Decade	Saved tracks
1980	474
2010	457
2000	374
1970	274
1990	246
1960	104
2020	44
1950	11
1930	1
0	1

when run on my library.⁷

This child of the 2000s is more interested in music from the 1980s.

Favorite genre

No tracks, let alone albums, are tagged with genres. Suppose artists only perform music within their niche.⁸ In that case, you can safely label a track with the union of all its artists' genres.

The documentation for the saved tracks endpoint claims every track's "artist" key contains artist genres. I found this was not true. You will need to call another endpoint to fetch artist genre information. If your library has $n$ artists, you will make $n$ API calls to determine your favorite genre.⁹

Introduce a new table to represent the many-to-many relationship between artists and genres:

create table artist_genre(
  artist_id text not null references artist(artist_id),
  genre text not null,
  primary key (artist_id, genre)
);

Partition tracks by genres from this join table to figure out the most popular genres:

select genre, count(distinct track.name) as count
from track
join track_artist using(track_id)
join artist_genre using(artist_id)
group by genre
order by count desc, genre;

Rock, mainly classic rock, dominates my library:

Genre	Saved tracks
rock	516
classic rock	308
soft rock	303
pop	292
album rock	278
mellow gold	267
hard rock	212
new wave pop	206

Competing streaming platforms like Apple Music recognized the marketing value of Spotify Wrapped and introduced similar campaigns. ↩
Apple Music Replay is updated weekly. You do not need a third-party service to gain insight into listening habits on Apple Music. ↩
Granular rate limiting is only possible with API keys. ↩
Your favorite scripting language should be Python. Only joking 😛. ↩
However, you can get a list of top tracks based on Spotify's notion of popularity. ↩

You can find out whether Multiple Count affects you by comparing the output of

select count(*) from track;

select count(distinct track.name) from track;

↩

0 is a sentinel value for the publication date of an orphaned album in my library. ↩
I mean, Rammstein is unlikely to collaborate with Taylor Swift. ↩
Alternatively, batch your requests. You will make $\left\lceil\dfrac{n}{50}\right\rceil$ API calls. ↩

PreviousNext