[Corpora-List] The musiXmatch dataset: lyrics from 237K songs

Tue Apr 12 13:38:12 UTC 2011

We are pleased to announce the release of the musiXmatch dataset, a  
corpus of lyrics from 237K songs in a bag-of-words format.
http://labrosa.ee.columbia.edu/millionsong/musixmatch
The musiXmatch dataset is the official lyrics collection of the  
Million Song Dataset (MSD).

Quick numbers: 237,702 lyrics in bag-of-words format, top 5,000 words  
provided.

This is the largest lyrics dataset ever released for research (to our  
knowledge).  It is useful on its own, but all the bags-of-words are  
also directly resolved to MSD tracks, which links them to metadata  
such as: artist name, song title, release year, similar artists, tags,  
audio features, etc...

We are extremely grateful for the generous donation of this data, and  
aid in preparation, by www.musixmatch.com

The data is clean, meaning that we have removed all known duplicates  
and instrumental songs. We also provide you with the musiXmatch track  
ID so you can verify the information yourself. The data comes split  
into train and test sets to encourage the reporting of comparable  
results, even on learning-based tasks.

Although we have worked hard on this release, we cannot claim it is  
perfect. We welcome questions, feedback, error reports, ...
Finally, try singing bags-of-words, now that's a challenge!

Thierry Bertin-Mahieux
for the Million Song Dataset team,
in collaboration with musiXmatch.com
http://labrosa.ee.columbia.edu/millionsong/musixmatch

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora