[Corpora-List] SMS Corpus for normalization

Cédrick Fairon cedrick.fairon at uclouvain.be
Mon Mar 31 20:10:28 UTC 2008


Dear colleague,

The Cental (Centre for natural language processing, UCLouvain,  
Belgium) is coordinating an international project which aims to  
collect SMS corora in various places of the world and for different  
languages: it is called the sms4science project (www.sms4science.org).

This international project has just started, but a couple of years  
ago, we have collected a corpus of 30.000 French messages that is  
already available for research purpose (see http://www.i6doc.com/doc/smscd) 
. The corpus is transcribed in "normalized SMS" as you wanted.  
Transcription was done manually following a precise protocol. A search  
interface enables the user to search for linguistic patterns using the  
"standard spelling" and it retrieves variants. Data are also available  
in text format so that you can analyse it with your own corpus  
processor.

Detailed information about this corpus and the methodology used for  
collecting data can be found in a book published at the Louvain  
University Press: "Le langage SMS, Étude d'un corpus informatisé à  
partir de l’enquête «Faites don de vos sms à la science" (Cédrick  
FAIRON , Jean René KLEIN et Sébastien PAUMIER). Info and order: www.i6doc.com/doc/sms)

If you wish more information, don't hesitate to ask, I will be pleased  
to answer.

Louise-Amélie Cougnon
Research assistant


Le 20-mars-08 à 10:09, Felipe Sánchez Martínez a écrit :
>
> Hello all,
>
> Do you know if there are available corpora (of any language, but
> preferably Spanish, French, English or Catalan) of SMS together with
> their normalization as well-written texts?
>
> thanks in advance,
>
> best regards
>
> -- 
> Felipe Sánchez Martínez <fsanchez at dlsi.ua.es>
> Departamento de Lenguajes y Sistemas Informáticos
> Universidad de Alicante, E-03071 Alicante (Spain)
> Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
> http://www.dlsi.ua.es/~fsanchez
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Cédrick Fairon
cedrick.fairon at uclouvain.be

Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06

http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list