[Corpora-List] SMS corpus

Cédrick Fairon cedrick.fairon at uclouvain.be
Fri Sep 1 14:06:29 UTC 2006


Dear Alexander,

The Centre for natural language processing at the University of  
Louvain (http://cental.fltr.ucl.ac.be) has collected a corpus of  
75.000 French sms (more than 2400 authors, aged 12 to 65). Details  
about the project are available online: http://www.smspourlascience.be

A subset of this corpus (30.000 SMS) has been released and published  
on a CD-ROM at the Louvain University Press and is available from  
http://www.i6doc.com/doc/sms (licence for non-profit organisations  
only, others may contact us).

Two interesting remarks about the corpus:
- it contains information about the authors'profile (sex, age,  
occupation, mother tongue, second language, place of living, etc.).  
These profiles are linked to the messages, so that you can select a  
subset of the corpus corresponding to given sociolinguistic details;
- each message was linked to a "transcribed" version in "standard"  
French so that you can search for a word and get all the variants  
present in the corpus.

All the info in C. Fairon, S. Paumier (2006). "A translated corpus of  
30,000 French SMS". In Proceedings of LREC 2006. Genova.

Best Regards,

Cédrick

Le 01-sept.-06 à 15:00, Alexander Osherenko a écrit :

> Hello,
>
> has anybody heard of a text corpus with SMS messages? Actually it  
> should be emotional, but at first it doesn't matter much.
>
> Best
>
> Alexander
>

Cédrick Fairon
cedrick.fairon at uclouvain.be

Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06

http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be



More information about the Corpora mailing list