Spotify Wrapped is a viral marketing campaign by the music streaming service Spotify. At the end of every year, Spotify presents each user with a summary of their listening habits.1
Keep reading to know what is possible with a laptop and API access. Or use stats.fm.
Keys to the Kingdom
Go through the OAuth Authorization Code flow to allow a machine -- your laptop -- to act on your behalf. Spotify for Developers documents the gory details.
The result of the interaction should be an access token. API servers only respond to requests containing a valid access token.
Spotify organizes data in the following way:
- Every track is associated with an album
- Every track is associated with one or more artists
- Every album is an "album," a "single," or a "compilation"
- An album may be associated with one or more genres
- An artist may be associated with one or more genres
Categorizing tracks by genre is subjective. In practice, only artists, not albums, are assigned genres. More on that later.
Assuming you have a valid access token, you can call the saved tracks endpoint to fetch "liked" tracks:
curl \ --request GET \ --url https://api.spotify.com/v1/me/tracks \ --header 'Authorization: Bearer acc3sst0k3n'
Spotify pages the response for obvious reasons. Use your favorite scripting language to fetch the whole library.4
Please be considerate and run your queries against cached data. Not directly against the Web API. I wrote the saved tracks API response to CSV files and loaded them into SQLite. The rest of this blog post will make sense if you do the same thing.
Each CSV file corresponds to an SQLite table. The table schemas mirror the way Spotify manages track data.
create table album ( album_id text primary key not null, name text not null, type text not null, release_year integer not null ); create table artist ( artist_id text primary key not null, name text not null ); create table track ( track_id text primary key not null, name text not null, album_id text not null references album(id) ); create table track_artist ( track_id text not null references track(track_id), artist_id text not null references artist(artist_id), primary key (track_id, artist_id) );
to apply the schemas, and then load data with the following commands:
.import track.csv track --csv .import album.csv album --csv .import artist.csv artist --csv .import track_artist.csv track_artist --csv
We can't determine the most streamed tracks from the Web API. It does not expose track play counts.5 Indeed, this is why the track table does not have a
stats.fm records track play count on an ongoing basis. If you import historical data, it displays play counts from the beginning of time.
Let's define an album's popularity by the number of saved album tracks.
To find popular albums,
- join each track to an album by its album ID,
- group tracks by album, and
- sort by frequency.
Here is a query that returns albums that have at least five tracks:
select album.name, count(*) as count from track join album using(album_id) group by album_id having count > 4 order by count desc, album.name;
Note that publishers can re-upload albums under a different ID. As far as I can see, this has two consequences:
- Multiple Count: users save the same track multiple times
- Split Count: users save different tracks across copies of an album
Neither of these phenomena is easy to notice while browsing Spotify.6 Unfortunately, they complicate our query and leave orphaned tracks in your library.
Group tracks by album name, not ID, to counteract Split Count:
select album.name, count(*) as count from track join album using(album_id) group by album.name having count > 4 order by count desc, album.name;
Deduplicate tracks by name to compensate for Multiple Count:
select album.name, count(distinct track.name) as count from track join album using(album_id) group by album.name having count > 4 order by count desc, album.name;
The last query outputs the table
|Back In Black||5|
|Brothers In Arms||5|
|The Last Stand||5|
when run on my library. (Note: I've removed soundtracks.)
Lover by Taylor Swift is my favorite album 😅. How embarrassing!
The query to find popular artists is similar to that used to find popular albums. This time, you must join three tables:
select artist.name, count(distinct track.name) as count from track join track_artist using(track_id) join artist using(artist_id) group by artist_id having count > 10 order by count desc, artist.name;
Artist profiles are almost immutable, so we don't have a Split Count problem. However, we have a Multiple Count problem. Address it in the same way as before -- deduplicate tracks by name.
This query outputs the table
when run on my library.
The results are unsurprising. Almost all featured artists enjoy -- or have enjoyed -- enormous commercial success. Sabaton is the only unusual result.
Let's define the popularity of an era by the number of library tracks published in that period.
You can find the number of tracks published per year by grouping tracks by publication year:
select release_year, count(distinct track.name) as count from track join album using(album_id) group by release_year order by count desc, release_year desc;
(2013 and 1984 were good years for music, at least as far as I am concerned.)
Counting library tracks by release decade is more interesting. You don't need a window function to do this. Notice that the release decade of a track is given by
release_year - release_year % 10, so the query
select release_year - release_year % 10 as release_decade, count(distinct track.name) as count from track join album using(album_id) group by release_decade order by count desc, release_decade desc;
partitions tracks by release decade.
when run on my library.7
This child of the 2000s is more interested in music from the 1980s.
No tracks, let alone albums, are tagged with genres. Suppose artists only perform music within their niche.8 In that case, you can safely label a track with the union of all its artists' genres.
The documentation for the saved tracks endpoint claims every track's
"artist" key contains artist genres. I found this was not true. You will need to call another endpoint to fetch artist genre information. If your library has artists, you will make API calls to determine your favorite genre.9
Introduce a new table to represent the many-to-many relationship between artists and genres:
create table artist_genre( artist_id text not null references artist(artist_id), genre text not null, primary key (artist_id, genre) );
Partition tracks by genres from this join table to figure out the most popular genres:
select genre, count(distinct track.name) as count from track join track_artist using(track_id) join artist_genre using(artist_id) group by genre order by count desc, genre;
Rock, mainly classic rock, dominates my library:
|new wave pop||206|
Competing streaming platforms like Apple Music recognized the marketing value of Spotify Wrapped and introduced similar campaigns. ↩
Granular rate limiting is only possible with API keys. ↩
Your favorite scripting language should be Python. Only joking 😛. ↩
You can find out whether Multiple Count affects you by comparing the output of
select count(*) from track;
select count(distinct track.name) from track;
0 is a sentinel value for the publication date of an orphaned album in my library. ↩
I mean, Rammstein is unlikely to collaborate with Taylor Swift. ↩