[Corpora-List] TweetLID: Corpus for Twitter language identification released

Iñaki San Vicente Roncal i.sanvicente at elhuyar.com
Wed Oct 1 12:20:49 UTC 2014


Dear Colleagues,

      We are happy to announce the release of the TweetLID corpus, built
for the TweetLID Twitter language identification shared task
<http://komunitatea.elhuyar.org/tweetlid>. TweetLID is a corpus of tweets
annotated for language identification. It contains 35K tweets in 6
languages (English, Spanish, Portuguese, Basque, Catalan, Galician). Each
tweet is annotated with the language (or languages) the tweet is written
in.

      The corpus is released under the Creative Commons License (CC BY),
and it is available for download in the following link:
http://komunitatea.elhuyar.org/tweetlid/files/2014/10/TweetLID_corpusV1.zip


      If you use this corpus, please cite the following paper:

      - Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria,
I., Aranberri, N., Ezeiza A., Fresno, V. (2014). Overview of tweetlid:
Tweet language identification at sepln 2014. Proceedings of the TweetLID
Worshop at SEPLN2014. Girona. pp. 1-11. ISSN: 1613-0076.

      You can find more information about the corpus and the shared task,
in the workshop website or in the proceedings (http://ceur-ws.org/Vol-1228/).


      For any further questions or suggestions do not hesitate to contact
us at tweetlid at elhuyar.com

Regards,

TweetLID organizers.

-- 

*Iñaki San Vicente Roncal*
I+G IKERTZAILEA / R&D RESEARCHER


i.sanvicente at elhuyar.com | <i.sanvicente at elhuyar.com>
<i.sanvicente at elhuyar.com>inaki.sanvicente at ehu.es |
<http://scholar.google.es/citations?user=eb_xVO4AAAAJ&hl=en>
<https://www.researchgate.net/profile/Inaki_San_Vicente/>
tel. Elhuyar: 943363040 | luzp.: 225
tel. Ixa: 943015110 | 314 bulegoa

Zelai Haundi, 3. Osinalde industrialdea
20170 Usurbil

*www.elhuyar.org* <http://www.elhuyar.org>* | **ixa.si.ehu.es *
<http://ixa.si.ehu.es>

Mezu honek, baita erantsitako edozein agirik ere, isilpeko informazioa izan
dezake. Informazio hori jasotzeko baimena izendatutakoak baino ez du. Zu ez
bazara adierazitako hartzailea, indarrean dagoen legeriaren arabera
debekatuta daukazu informazio hori baimenik gabe erabili, hedatu
eta/edo kopiatzea.
Mezu hau errakuntza baten ondorioz jaso baduzu, jakinarazi bidaltzaileari,
eta ezaba ezazu. Eskerrik asko.

Ez inprimatu mezu hau ezinbestekoa ez bada.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141001/88397575/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list