Collecting Data:

We found the initial and raw data at https://data.world/, a csv file composed of more than 20,000 records contain the meta-data of Top-100 Billboard songs in the last 50 years.

We used the meta-data to draw the lyrics of each song from genius.com, using LyricsGenius API,

and then filtered the non-English songs using Google translate API.

More process has been done to clear the data:

Inverted commas and parenthesis have been removed.
Numbers were converted to a special char (which will be restored to a random number at generation)
All characters lowercased.
Remove words composed of special or non-English characters.

You can find the final file contains the above data in our git, and we invite you to make further research on this data and meta-data.

Model's Learning Process Differences:

Both models are different in their learning process, and to make the conditions similar as possible,

several decisions were taken:

1) Punctuations and '\n' as words for Trigram-model:
Char-Model studies the data char by char, and it can easily recognize the '\n' character and other punctuation marks, unlike the Trigram-model.

Initially, the input for the Trigram-model was a list of words for each song, which ended with a song of one

long line, regardless of structure.

To correct this, we separated the '\n' character from the string, and treated it as a word.

Similar decisions were taken related to punctuation marks.

2) The history of Char-Model considers the last 11 characters only, since it is approximatly two words, as used

in Trigram-Model.

3) In both models, padding was added at the begging and at the end of the songs since we wanted the models to

recognize and study words or characters which appear in those places.

In this way, the models will know when to stop generating char\words.

4) By analyzing the Billboard database, we found that a song is composed of 349 words and 1722 characters in

average. As a result, models were limited to generate songs with an upper bound of 700 words for Trigram-

Model, and 2000 characters for Char-Model, to get more reliable results.

Models creation and evaluation:

Each model was created by learning the Billbaord databse lyrics, and then generated 20K new songs.

From the generated songs we extracted relevant statistics and additional information that includes description of the song's structure, common words, and sentiment analysis.

Further information can be found under 'Results' tab.

Words Cloud:

Recall that one of our goals was to generate new hits.

Beyond the fact that the models take into consideration the time each song "spent" in the Billboard chart,

we decided to list the 200 most common words of Billboard's songs lyrics.

Then we tested how many of them appear in the best hit created by each model.

Sentiment Analysis:

Another element related to the content of the data, which is in our case, the lyrics, is whether the lyrics are positive, negative or neutral. After giving a score between -1 to 1 to each song, when 1 is positive and -1 is negative, we compared the quantities of songs of each label from each dataset.

In this part we used NLTK framework.

Language Model Comparison

And Songs Generator

Based Top-100 Billboard Songs 1970-2018

Collecting Data:

Model's Learning Process Differences:

Models creation and evaluation:

Words Cloud:

Sentiment Analysis: