[Corpora-List] SMS Corpus for normalization
Cédrick Fairon
cedrick.fairon at uclouvain.be
Mon Mar 31 20:10:28 UTC 2008
Dear colleague,
The Cental (Centre for natural language processing, UCLouvain,
Belgium) is coordinating an international project which aims to
collect SMS corora in various places of the world and for different
languages: it is called the sms4science project (www.sms4science.org).
This international project has just started, but a couple of years
ago, we have collected a corpus of 30.000 French messages that is
already available for research purpose (see http://www.i6doc.com/doc/smscd)
. The corpus is transcribed in "normalized SMS" as you wanted.
Transcription was done manually following a precise protocol. A search
interface enables the user to search for linguistic patterns using the
"standard spelling" and it retrieves variants. Data are also available
in text format so that you can analyse it with your own corpus
processor.
Detailed information about this corpus and the methodology used for
collecting data can be found in a book published at the Louvain
University Press: "Le langage SMS, Étude d'un corpus informatisé à
partir de l’enquête «Faites don de vos sms à la science" (Cédrick
FAIRON , Jean René KLEIN et Sébastien PAUMIER). Info and order: www.i6doc.com/doc/sms)
If you wish more information, don't hesitate to ask, I will be pleased
to answer.
Louise-Amélie Cougnon
Research assistant
Le 20-mars-08 à 10:09, Felipe Sánchez Martínez a écrit :
>
> Hello all,
>
> Do you know if there are available corpora (of any language, but
> preferably Spanish, French, English or Catalan) of SMS together with
> their normalization as well-written texts?
>
> thanks in advance,
>
> best regards
>
> --
> Felipe Sánchez Martínez <fsanchez at dlsi.ua.es>
> Departamento de Lenguajes y Sistemas Informáticos
> Universidad de Alicante, E-03071 Alicante (Spain)
> Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
> http://www.dlsi.ua.es/~fsanchez
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
Cédrick Fairon
cedrick.fairon at uclouvain.be
Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06
http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list