As the Internet of Things (IoT) landscape continues to grow, practically everyone and everything, it seems, is compiling data or at least generating statistics for any variable imaginable. And music is no exception. The music streaming service, Spotify, stores an array of features for each song in its library. They record characteristics such as acousticness, danceability, energy and more. With these variables in mind, I conducted some exploratory analysis along with a couple clustering methods on Kanye West’s Spotify discography. The following report utilizes Spotify’s Web API through Charlie Thompson’s spotifyr
package which you can check out here.
Exploratory Data Analysis
Load Packages
The following code and plots make use of these packages:
library(spotifyr)
library(tidyverse)
library(knitr)
library(kableExtra)
library(ggridges)
library(plotly)
library(scales)
library(ggfortify)
library(ggdendro)
library(dendextend)
Import Data
First let’s import the audio features for Kanye West and take a quick look at the data.
kanye <- get_artist_audio_features(artist = "kanye west")
After viewing the data, I noticed a few tracks (rows) are duplicated since some albums contain edited, clean, and/or live versions. Those rows will be removed in addition to some irrelevant columns. Also let’s be sure we don’t have any missing values.
kanye2 <- kanye %>%
filter(!(album_name %in% c("808s & Heartbreak (Softpak)", "Late Orchestration",
"The College Dropout (Edited)",
"Graduation (Alternative Business Partners)"))) %>%
select(-c(artist_uri, album_uri, album_type, is_collaboration, track_uri,
track_preview_url, album_release_year, artist_name, album_img,
album_release_date, track_open_spotify_url, track_number,
disc_number, key, mode, key_mode, album_popularity, time_signature))
sum(is.na(kanye2))
## [1] 0
The dataset we will be working with now has 125 rows and 13 columns with no missing observations!
###Danceability Spotify defines danceability as
“How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.”
Before determining which album Spotify deems the most danceable, let’s take a look at how danceability is distributed on each album.
The dashed vertical line in the middle represents the midpoint on the danceability scale. Also, the smaller tick marks at the bottom of each shape represent one song on an album. Each album appears to be more danceable than not, but let’s weight each song by its duration to get a better picture of each album in its entirety. The following graph outlines the results.
kanye_dance <- kanye2 %>%
mutate(total_dance = danceability*duration_ms) %>%
group_by(album_name) %>%
summarise(avg_danceability = sum(total_dance)/length(album_name)/10000)
Overall, My Beautiful Dark Twisted Fantasy and 808s & Heartbreak represent Kanye’s most danceable albums. The following table lists his top 10 most danceable songs.
kanye_dance_songs <- kanye2 %>%
arrange(desc(danceability)) %>%
select(album_name, track_name, danceability)
kable(head(kanye_dance_songs, 10)) %>%
kable_styling(full_width = F)
album_name | track_name | danceability |
---|---|---|
ye | All Mine | 0.925 |
Late Registration | Gone | 0.851 |
KIDS SEE GHOSTS | Kids See Ghosts | 0.841 |
The Life Of Pablo | Feedback | 0.837 |
The Life Of Pablo | 30 Hours | 0.822 |
808s & Heartbreak | Paranoid | 0.812 |
808s & Heartbreak | Heartless | 0.789 |
Yeezus | Black Skinhead | 0.775 |
The Life Of Pablo | Facts (Charlie Heat Version) | 0.769 |
KIDS SEE GHOSTS | 4th Dimension | 0.765 |
Valence
Spotify defines Valence as
“A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)”"
Again, let’s look at the overall distribution for each album, but for valence this time.
Perhaps a bit more polarizing than the danceability metric, most albums seem to have their fair share of both positive and negative sounding songs. Now weighting valence by song duration, let’s discover Kanye’s happiest (and saddest) album.
kanye_val <- kanye2 %>%
mutate(total_valence = valence*duration_ms) %>%
group_by(album_name) %>%
summarise(avg_valence = sum(total_valence/length(album_name)/10000))
With a valence score of 14.68, the happiest Kanye release came way back in 2004 with his debut hit The College Dropout. Interestingly, the four albums with the lowest valence score also make up his most recent work. The tracks with the lowest valence scores are shown below.
kanye_valence_songs <- kanye2 %>%
select(album_name, track_name, valence) %>%
arrange(valence)
kable(head(kanye_valence_songs,10)) %>%
kable_styling(full_width = F)
album_name | track_name | valence |
---|---|---|
The Life Of Pablo | Frank’s Track | 0.0000 |
Yeezus | Hold My Liquor | 0.0399 |
ye | Violent Crimes | 0.0400 |
The Life Of Pablo | Waves | 0.0565 |
808s & Heartbreak | Welcome To Heartbreak | 0.0734 |
Graduation | Can’t Tell Me Nothing | 0.0963 |
My Beautiful Dark Twisted Fantasy | Monster | 0.0964 |
Graduation | I Wonder | 0.1060 |
My Beautiful Dark Twisted Fantasy | Runaway | 0.1090 |
The Life Of Pablo | Wolves | 0.1180 |
Other Variables
Interested in exploring the rest of the variables? Choose which characteristics to plot and select which albums to compare using the interactive graph below!
Clustering
Hierarchical Clustering
Now on to some clustering methods. Let’s determine which albums sound the most alike using hierarchical clustering. A tree with a height of 4 and complete linkage is shown below.
kanye3 <- kanye2 %>%
select(-c(track_name, track_popularity)) %>%
group_by(album_name) %>%
summarise(dance = sum(danceability*duration_ms)/length(album_name),
energy = sum(energy*duration_ms)/length(album_name),
loudness = sum(loudness*duration_ms)/length(album_name),
speechiness = sum(speechiness*duration_ms)/length(album_name),
acousticness = sum(acousticness*duration_ms)/length(album_name),
instrumentalness = sum(instrumentalness*duration_ms)/length(album_name),
liveness = sum(liveness*duration_ms)/length(album_name),
valence = sum(valence*duration_ms)/length(album_name),
tempo = sum(tempo*duration_ms)/length(album_name)) %>%
remove_rownames() %>%
column_to_rownames("album_name")
kanye.hc <- hclust(dist(scale(kanye3)), method = "complete")
kanye.tree <- dendro_data(kanye.hc, type = "rectangle")
kanye.hc.4 <- cutree(kanye.hc, k = 4)
Kanye’s first three releases belong to one cluster while his last four belong to another. The middle releases, 808s & Heartbreak and Twisted Fantasy, are not only his most danceable records but his most unique sounding as well; they each populate their own cluster.
Principle Component Analysis
Now let’s use those variables and see if we can discover if his more popular songs score similarly for each variable. Using a PCA to reduce dimensionality and account for correlation we can try to reveal any patterns.
kanye_pca <- prcomp(kanye2[,3:11], center = T, scale = T)
summary(kanye_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5958 1.2793 1.0779 0.9988 0.9512 0.84785 0.69853
## Proportion of Variance 0.2829 0.1819 0.1291 0.1108 0.1005 0.07987 0.05422
## Cumulative Proportion 0.2829 0.4648 0.5939 0.7047 0.8053 0.88513 0.93934
## PC8 PC9
## Standard deviation 0.62701 0.39085
## Proportion of Variance 0.04368 0.01697
## Cumulative Proportion 0.98303 1.00000
Using two principle components we can describe a little over 46% of the variability in the data (shown in the plot below). To explain at least 90% of the variability, we need to use 7 principle components. The following graph shows the top two principle components with the top quartile of his most popular songs mapped to one color and the rest to another.
The most popular songs don’t appear to belong to any specific area or cluster on the graph, but it does look like Energy
and Loudness
are correlated. Try graphing them using the interactive plot above!