Data Mining: Smule App Data🎤

Background
Smule is a social singing app that I use to sing with other people around the world. I am quite active in it for various reason, but I wouldn't go so far to say that I am Retsuko in real-life. My bestie does like to think that way though 😜

Motivation

After singing togehter with the same people for a few years, I couldn't help thinking, 'How many songs have we sung together in total? What have we sung so far?' Unfortunately, Smule -just like any other social media- does not really give you a year-end report like how you would get your credit card usage statement😂

smule_page

To get the songs listed on the page in Smule is quite an endless process. You have to scroll through the page. As you scroll down, the page will give more content until you hit the first song you have ever sung. Imagine there's over 1000s of songs on your page and the page shows only 25 at a time...and your internet suddenly crashes☠ You get the idea, right?

Now let's say you were really passionate about finding the list of all the songs and scrolled through those 100s of pages (FYI: Even I don't do that!), you will probably have to copy + paste the specific item on the page into some sort of a spreadsheet field by field and one by one. You can pay an intern to do this for months. Or just let the code do it for you🤓

Yes, I am very obssessed with getting those data! What could be better than using my own singing data to demonstrate my coding skills AND my personal life interest!?

Tools I used
  • Python: Libraries used include json, pandas and requests
  • PostgreSQL: Database of my choice
  • SQL: Running queries to get info for summary report
  • Jupyter notebook: Where I ran all my python code

Goal

I just want a list of all the songs I have sung and I want to know what does my singing activity look like since I started using this app in 2016 ( •̀ ω •́ )✧ That's it! No elaborate hopes and dreams! ...Though, for the purpose of this project, I wrote up the summary report to get some insight!

Blob of Info
json_blob
Nice Table
smule-song-table

I want to be respectful of my singing buddies' privacy, so I will redact everything that can be traced back to them.

Data Fishing

Trying to get my hands on the data was actually the hardest part of this project. Here's my journey:

  • Any API request? Nope... Okay, fine.
  • No year-end report in any format?
  • Web scraping with BeautifulSoup (python): ...code can't be soupify...
  • Alright I'm gonna stop wasting time lamenting. Moving on.. MORE GOOGLE!

From googling around, I finally got my hands on the URL where I can get the data I need in a json format. Bless those beautiful souls for sharing the URL on stackoverflow and Sing Salon.

STARTER URL ACQUIRED

https://www.smule.com/s/profile/performance/<username>/sing?offset=<#>

Parsing-out JSON Object

Let the cherry-picking begins!

json_request
For my own purpose, I am only interested in the following field:
  • "title"
  • "artist" (looks like the song uploader didn't specify artist for this one)
  • "created_at"
  • "web_url"
  • "performed_by (not shown here)"

I actually created a project on my GitHub that is dedicated for extracting data [and transforming too]. The code I share actually gives you the information on how many songs have the two people sung together. For my own purpose, I extracted only data from my account.

Multiple Issues...

Unfortunately, the code that I shared for Data Extraction does not always work on the first go. I was able to run the whole notebook in one-go one day, the 2nd loop got an error on another. Also, each time I ran the code, I will have to wait at least 30 minutes before I ran the same request code and I would not be able to have access to Smule page for a similar amount of time (got a 418 Error Code). Until now, I still couldn't figure out how to bypass the rate limitation on making request for the json data. I added errors & exceptions which seemed to work, but later on it gave me the same result as the code without try-except. I tried adding time.sleep(2) and that didn't do anything either.

As a last resort, instead of making two requests (2 usernames, 2 URLs) in the same notebook I made only 1 request on just 1 username. I exported the output to csv, waited 30 minutes to restart & rerun the code with the 2nd username, and exported that 2nd csv. Then, I appended the two dataframes and reset the index (need to do this for SQL database) before exporting that final dataframe to csv to be loaded. A whole set of jupyter notebooks were created for this alternative method. Yeah, I know. I AM obssessed with getting these data.

Cleaning Up

Most of the data returned is actually pretty clean, but I want the data to get spat out in a certain way. I tweaked the output of the two fields: date ("created_at") and web URL ("web_url").

Instead of having the date and timestamp (stalkable parameter right there 😱), I only want a return of just the date:

  • "2020-08-16T20:54:36-07:00" → "2020-08-16"

The web URL returned is only part of the url. I want a full URL:

  • "/recording/asca-koe-tv-size/🤐🤐🤐/" → "https://www.smule.com/recording/asca-koe-tv-size/🤐🤐🤐/"

Extra Clean-up

I did an extra clean-up step to replace the real date with a fake one and to also replace the actual username with just '-username-' for demo purpose.

Creating Dataframe🐼

Once all the data is in the format I want, it's time for pandas to shine!

transform_table

Write to CSV

I ran the code to write the dataframe to csv. Then, I did some SQL-querying to present the data and talk about it.

NICE TABLE ACQUIRED

Results

Since the date I started using the app, September, 2016, until the date the data was extracted, September, 2020, I have made a total of 1858 recordings.

Number of Songs
1858
Number of Song Titles
1071
Number of Invitation Spawners
185
Collaboration type Number of Songs
Inviting 869
Joining 989
Invitation Spawner Number of Songs
Buddy1 110
Buddy2 56
Buddy3 54
Buddy4 44
Buddy5 39
Buddy6 36
... ...
StrangerY 1
StrangerZ 1

There are 869 recordings I created, either solos or duets for others to join. I have joined others on 989 songs.

Out of those 989 songs, I have sung with 185 different Smule-users. I sang as many as 100s songs with 1 person and as little as 1 song.

Now, let's just look at my own singing activity for these past years:

Year     Number of Songs
2016 106
2017 746
2018 453
2019 327
2020 226

Yearly

  • The most singing happens in 2017.
  • Though there are 106 songs in 2016, this is within 4 months (start date is September).
  • The amount of singing declines in 2018 by almost 40% compared to the year 2017.
    • What's the Reason? Career progression.
Month Number of Songs
1 203
2 182
3 152
4 148
5 173
6 186
7 159
8 139
9 116
10 143
11 139
12 118

Monthly

  • The frequency of singing fluctuates throughout the year, with most of the singing happens in January.
    • Minnesota's winter is no joke!
  • Similar trend, though a little lower, in February as a wave of short-lived spring invites my presence outside. Then, it's less singing as it gets warmer to be outside.

Let's look at some of the songs that I sang most frequently next.

From looking at the resulted table, there are 1071 recordings with unique titles. I said 'unique titles' because if the same songs have have different titles (depends on how the song was being uploaded and whether it's a piano version, a guitar version, etc.) within the app, they are being count separately.

Top 15 Most Frequent Songs
サリシノハラ is pretty easy to sing and a very popular one, so I could see why I sang it so much. 心做し is also popular, though the short version only. The note is pretty high to casually sing it frequently. My favorite from these list are actually just HEAVEN and しわ. Nah, those URLs don't lead to my recording🤭

Song Title Frequency
サリシノハラ 14
心做し【Short.】 14
独りんぼエンヴィー 9
HEAVEN 9
Tokyo Teddy Bear 8
Song Title Frequency
Magnet 8
しわ -Romaji- 8
アイのシナリオ (TV Size) 8
Zoetrope 8
夜もすがら君想ふ / Romaji 8
Song Title Frequency
背徳の記憶 ~The Lost Memory~ 7
only my railgun 7
小夜子 [Original] 7
WAVE 7
Acute 7

Note: I edited some of the songs' title to be more readable.

Insight
retsuko_death_metal

What are some of the inferences we could make from this?

  • I sing a lot.
  • At least from the data, we can say I know how to sing over 1000 songs.
  • I tend to sing with certain people more than others (Buddy vs Stanger).
  • Yes, I am quite an otaku, based on my top 15 most frequent songs.
  • There is not enough data to indicate whether I am a good or a bad singer 😐

Want to see some actual dashboards out of this data? Go to the '2.0' project, Data Visualization: Smule App Data.