GitHub - SpotiScryers/SpotiScry: What makes a song reach the top of the charts while others flop? Using data from Spotify, our team will determine what features influence song popularity - such as the danceability or song length. We will then predict a song’s popularity.

About the Project
Goals | Background | The Data | Deliverables | Outline
Data Dictionary
Original Features | Engineered Features
Initial Thoughts & Hypotheses
Thoughts | Hypotheses
Project Steps
Acquire | Prepare | Explore | Model | Conclusions
How to Reproduce & More
Steps | Tools & Requirements | License | Creators

About the Project

What makes a song reach the top of the charts while others flop? Using data from Spotify, our team will determine what features influence song popularity - such as the danceability or song length. We will then predict a song’s popularity. You can check out our presentation here and our Spotify playlist our data comes from here.

Goals

Build a dataset of songs using Spotify's API
Identify the drivers of song popularity
Create a regression model to predict how popular a song will be that has an RMSE lower than the baseline

Background

What makes a song popular? According to Splinter News here,

"making a 'good' number one song is not necessarily the same as making a 'good' song in general. It's not about artistry (though sometimes artistry does hit number one). It's about popularity. And not long-term popularity. But popularity right here, right now."

By analyzing Spotify's API data, we will determine ourselves what influences a song's popularity.

The Data

Our dataset came from a personally curated Spotify playlist by Kwame Taylor. It includes almost 6,000 songs in the hip-hop genre from the 80s to today. Browse our playlist at anytime by scanning the QR code or just clicking the image below.

Deliverables

Video presentation
Presentation slides via Canva here
Tableau Storybook here
GitHub repository with analysis

Project Outline

The files within the repository are organized as follows. The /images and /sandbox contents are not necessary for reproduction.

Timeline

Project Planning: December 8th
Aquisition and Prep: December 10th
Exploration: December 14th
Modeling: December 15th
Finalize Minimum Viable Product (MVP): EOD December 15th
Improve/Iterate MVP: December 17th
Finalize Presentation: December 31st

Acknowledgments

Continuous data stratification by Danil Zherebtsov
Using Spotipy Library by Max Hilsdorf
The Most Successful Labels in Hip Hop: Every hip hop record label, since 1989, sorted by their artists' chart performance on Billboard, by Matt Daniels and Kevin Beacham
What Is “Escape Room” And Why Is It One Of My Top Genres On Spotify?: Using data to understand how genres understand us, by Cherie Hu
Tunebat
The Case For Lil Jon As One of Hip-Hop’s Greatest Producers by Erich Donaldson

Back to Table of Contents

Data Dictionary

Original Features

Below are the features included in the orginal data acquired from the Spotify API.

Feature	Description
artists	The artists who performed the track
album	The album in which the track appears
track_name	The name of the track
track_id	The spotify ID for the track
danceability	A value of 0 - 1 that represents a combination of tempo, rhythm stability, beat strength, and overall regularity
energy	A value of 0 - 1 that represents a perceptual measure of intensity and activity. The faster, louder, noisier a track is the higher the energy
key	The estimated overall key of the track, integers map to pitches using standard Pitch Class notation. If no key was detected, value is -1. 0 = C 1 = C# 2 = D etc.
loudness	The overall loudness of a track in decibels (dB). Values typically range between -60 and 0
mode	The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major = 1, Minor = 0
speechiness	A value of 0 - 1 that represents how exclusively speech-like the recording is. Values above .66 are made almost entirely of spoke words, .33 - .66 values may contain both music and speech, either in sections or layered. Values .33 most likely represent music and other non-speech-like tracks.
instrumentalness	Predicts whether a track contains no vocals, The close the instrumentalness value is to 1 the greater the likelihood the track contains no vocal content. Values above .5 are intended to represent instrumental tracks, but confidence is higher as the value aproaches 1.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0 - 1 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (happy, cheerful, euphoric), while tracks with low valence sound more negative (sad, depressed, angry).
tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	The duration of the song in ms
time_signature	An estimated overall time signature of a track, the time signature is a notational convention to specify how many beats are in each bar.
release_date	The date the album was first released, if only year was given as precision it defaults to YYYY-01-01
popularity	Target variable, value between 0 - 100 that measures how many views the track has gotten in relation to how current those views are.
explicit	Boolean variable for whether or not the track has explicit lyrics.

Engineered Features

Using domain knowledge and exploration insights, we also engineered features using the original data. These created features are below.

Feature Name	Description
duration_seconds/minutes	Converting the track duration in milliseconds to seconds and minutes, rounded integers
is_featured_artist	Boolean value if the track name includes 'feat', meaning an additional artist is on the track
decade	The decade the track was released in based on the release year, 80s - 90s - 2000s - 2010s - 2020s
top_ten_label	Boolean if the track is produced by a top record label (based on count of songs produced by the record and the average popularity)
popularity_bins	Binned values on popularity feature using domain knowledge: 0-10 as 'Very Low', 11-40 as 'Low', 41-70 as 'moderate', and 71-100 as 'High'
danceability_bins	Binned values on danceability feature using qcut to create three equal bins: 0-.69 as 'Low', .70-.80 as 'Medium', .81-1.0 as 'High'

Back to Table of Contents

Initial Thoughts & Hypotheses

Thoughts

What are the drivers of popularity on Spotify?
Is there a seasonality to the popularity of tracks?
Are originals or remixes more popular?
Since 2020 has been the year of the pandemic, are more people listening to sad songs right now?
Are people's musical tastes expanding or experimenting due to the "new normal" of stay-at-home culture?
Does loudness have a relationship with popularity?
Does the instrumental-to-lyrical ratio of a track have an effect on its popularity?

Hypotheses

𝐻0: Mean of song popularity of explicit tracks = Mean of song popularity of non-explicit tracks
𝐻𝑎: Mean of song popularity of explicit tracks > Mean of song popularity of non-explicit tracks

𝐻0: Mean of popularity of major key songs =< Mean of popularity of minor key songs
𝐻𝑎: Mean of popularity of major key songs > Mean of popularity of minor key songs

𝐻0: Mean of popularity of time signature 4 =< Mean of popularity of all songs
𝐻𝑎: Mean of popularity of time signature 4 > Mean of popularity of all songs

𝐻0: There is no linear relationship between song length and popularity.
𝐻𝑎: There is a linear relationship between song length and popularity.

𝐻0: There is no linear relationship between liveness and popularity.
𝐻𝑎: There is a linear relationship between liveness and popularity.

𝐻0: There is no difference in popularity between tracks released by the top 10 labels or not.
𝐻𝑎: Tracks released by the top 10 labels are more likely to be popular.

𝐻0: There is no difference in popularity between tracks released by the worst 5 labels or not.
𝐻𝑎: Tracks released by the worst 5 labels are more likely to be unpopular.

𝐻0: there is no difference between songs released in 2020 popularity and the overall average.
𝐻𝑎: there is a difference between songs released in 2020 popularity and the overall average.

Back to Table of Contents

Project Steps

Acquire

Data was acquired from Spotify API using the spotipy library. Going to this website https://developer.spotify.com/dashboard/login let us create a spotify web app that gave us a client id and client secret. This allowed us to use the create_spotipy_client function to create our own spotipy client that could access the API.

The dataframe is saved as a csv file and has around 5900 observations, otherwise in the acquire.py file there is function for grabbing the entire capstone playlist as well as a function for acquiring any additional playlists should you choose. There are 24 columns in the original data frame, this ranges from track and album metadata to audio features for that track. There are very few nulls which have been marked as null in the data acquisition function for ease of removal later in prepare.

Prepare

Functions to prepare the dataframe are stored in two seperate files depending on their purpose, prepare.py and preprocessing.py:

prepare.py: Functions for cleaning and ordering data

release dates that only specify the year are set to '01-01' for month and day
nulls are dropped
set track id to index
change dtypes to correct type
fix tempos
- From Kwame: "As a hip-hop artist and producer, I know firsthand how BPM (beats per minute, aka the tempo of a song) can often be miscalculated as twice their actual value. This is because most song tempos fall in-between 90 and 160 BPM, and a computer can wrongly detect tempo as double-time in slower tempos below 90. There are some genres that have faster BPM, such as 170 to 190 for Drum ’n’ Bass, however, in Hip-Hop I’ve found that the BPM is wrongly miscalculated in this way when it’s shown as 170 and above. Therefore, in our data, I chose to halve the tempos of all tracks with 170 BPM or greater for a more accurate look at tempo."

preprocessing.py: Functions for adding features we found interesting / modyifying data for ease of use in exploration

convert track length from ms to seconds & minutes
lowercase artist, album, and track name
create column for year, month, and day for release date
bin release year by decade

Explore

During exploration we looked at these features:

if a track is explicit
liveness
song length
time signature
key
loudness
original vs remix
instrumentalness
danceability

Model

First we made a baseline model to compare our model performances. The baseline was based on the average popularity for a track in our train split, which means our baseline prediction came out to a popularity of 38. The baseline model had an RMSE of 22.8 on the train split. We created various regression models and fit to the train data.

Feature Groups We used three sets of feauture groups.

Select K best: selects features according to the k highest scores (top 5)
Recursive Feature Elimination: features that perform best on a simple linear regression model (top 5)
Combination (unique features from both groups, 7 features)

Models Evaluated

OLS Linear Regression
LASSO + LARS
Polynomial Squared + Linear Regression
Support Vector Regression using RBF Kernel
General Linear Model with Normal Distribution

Evaluation Metric
Models are evaluated by calculating the root mean squared error (RMSE) or residual of the predicted value to the actual observation. The smaller the RMSE, the better the model performed. A visual of this error is below.

Final Model:
Polynomial Squared + Linear Regression was our final model we performed on test, predicting 6% better than the baseline.

Model	Train RMSE	Validate RMSE	Test RMSE
Polynomial 2nd Degree	21.599581	21.5257	21.5236
OLS Linear Regression	21.796331	21.7566
Support Vector Regression	21.812662	21.6988
General Linear Model - Normal	21.821093
Baseline - Average	22.897138
LASSO + LARS	22.897138

How It Works:
Polynomial Regression: a combination of the Polynomial features algorithm and simple linear regression. Polynomial features creates new variables from the existing input variables. Using a degree of 2, the algorithm will square each feature, take the combinations of them, and use the results as new features. The degree is a parameter that is a polynomial used to create a new feature. For example, if a degree of 3 is used, each feature would be cubed, squared, and combined with each other feature. Finally, a regression model is fit to the curved line of best fit depending on the degree. An example of determining best fit is below.

Conclusions

Key drivers for popularity include danceability with speechiness, whether a track is explicit, energy, track number, and whether a track has featured artists or not. The best performing model was our 2nd Degree Polynomial Regression model with an RMSE of 21.5236 on the testing dataset. The most popular songs were about ~2 minutes long.

Back to Table of Contents

How to Reproduce

Steps

~~Read through the README.md file~~ ✅
Download acquire.py, prepare.py, preprocessing.py, and data folder.
If you don't have spotipy installed run this in your terminal: ~~~pip install spotipy~~~
Login/Sign up at https://developer.spotify.com/dashboard/login to create a Spotify webapp that'll give you your client id and client secret.
Create an env.py file in your working directory and save this code after swaping out your individual client id and secret:

cid = YOURCLIENTID
c_secret = YOURCLIENTSECRET

Using the functions in acquire create a spotipy client.
Use the functions in prepare.py and preprocessing.py to clean and set up your data.
Enjoy exploring the data!

Tools & Requirements

License

Creators

Brandon Martinez, Bethany Thompson, Kwame V. Taylor, Matthew Mays
Back to Table of Contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

About the Project

Goals

Background

The Data

Deliverables

Project Outline

Timeline

Acknowledgments

Data Dictionary

Original Features

Engineered Features

Initial Thoughts & Hypotheses

Thoughts

Hypotheses

Project Steps

Acquire

Prepare

Explore

Model

Conclusions

How to Reproduce

Steps

Tools & Requirements

License

Creators

About

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
data		data
images		images
sandbox		sandbox
.gitignore		.gitignore
README.md		README.md
SpotiScry_Final_Notebook.ipynb		SpotiScry_Final_Notebook.ipynb
acquire.py		acquire.py
crossval.py		crossval.py
explore.py		explore.py
model.py		model.py
prepare.py		prepare.py
preprocessing.py		preprocessing.py

SpotiScryers/SpotiScry

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About the Project

Goals

Background

The Data

Deliverables

Project Outline

Timeline

Acknowledgments

Data Dictionary

Original Features

Engineered Features

Initial Thoughts & Hypotheses

Thoughts

Hypotheses

Project Steps

Acquire

Prepare

Explore

Model

Conclusions

How to Reproduce

Steps

Tools & Requirements

License

Creators

About

Topics

Resources

Stars

Watchers

Forks

Contributors 4

Languages