[Corpora-List] CFP: TweetLID Twitter language identification challenge

Iñaki San Vicente Roncal i.sanvicente at elhuyar.com
Wed May 28 10:47:11 UTC 2014

Workshop on Tweet Language Identification
co-located with the XXX Conference of the Spanish Society for Natural
Language Processing, SEPLN 2014
Girona, Spain

September 16, 2014


TweetLID 2014 invites researchers to submit novel and unpublished work on
the identification of the language or languages in which a tweet is
written. We have organized a shared task for this purpose, where we will
provide participants with a suitable corpus and evaluation methodology to
pursue the development of such research.


TweetLID is a workshop and shared task on the automatic identification of
the language in which tweets are written. It will take place on September
16, 2014, in Girona, co-located with SEPLN 2014. The objective of the task
is to bring together researchers interested in the topic, as well as to
join forces to experiment with and compare different approaches for
identification of tweet languages.

The identification of tweet language is arousing an increasing interest in
the scientific community (Carter et al., 2013). Identifying the language in
which a tweet is written is crucial if we intend to apply NLP techniques
subsequently on the tweet, e.g., machine translation, sentiment analysis,
information extraction, etc. Accurately identifying the language will
facilitate the application of resources suitable to the language in

However, despite the increasing volume of research in identification of
major languages such as English, French, or Spanish, the application of
these techniques to other languages with lesser presence on Twitter has not
been studied in detail. The scope of the task will focus on the 5 top
languages of the Iberian Peninsula (Spanish, Portuguese, Catalan, Basque,
and Galician), besides English. These languages are likely to co-occur
along with many news and events relevant to the Iberian Peninsula, and thus
an accurate identification of the language is key to make sure that we use
the appropriate resources for the linguistic processing.

The workshop aims to be a forum where researchers will have a chance to
compare their algorithms, systems, and results. The organizing committee
will release an annotated development corpus that will enable participants
to train their systems. The final evaluation will be conducted with another
unannotated corpus that the participants will have to submit with their
results in a short period of time.


The corpus that we will provide to participants of the shared task includes
geolocated tweets posted from different regions of the Iberian Peninsula,
with a strong focus on bilingual areas. We have built a corpus of tweets
annotated with the language(s) they are written in. We will split this
corpus into a training set, which will be shared with participants in the
first stage, and into another test set, which will be released in the
evaluation stage. The participants will have to develop their systems to
identify the language(s) of the tweets in the test set, and submit their
responses. Each participant will be allowed to submit the responses of up
to two systems.


Interested participants need to register for the task and workshop by
sending an email to tweetlid at elhuyar.com on or before May 30th.

Paper submission

Submissions will not exceed the maximum length of 4 pages, and will be
formatted following the SEPLN journal styles (

The proceedings of the workshop will be published using the
ceur-ws.orgrepository, and will be indexed by DBLP.

Important dates

* June 6th: Inscription deadline
* June 2nd: Release of the development-set
* July 1st: Release of the test-set
* July 3rd: Result submission deadline
* July 12th: Result publication
* July 25th: Short paper submission deadline
* August 31st: Papers' camera ready version
* September 16th: Workshop


*Iñaki San Vicente Roncal*

i.sanvicente at elhuyar.com | <i.sanvicente at elhuyar.com><i.sanvicente at elhuyar.com>
inaki.sanvicente at ehu.es |
 tel. Elhuyar: 943363040 | luzp.: 225
tel. Ixa: 943015110 | 314 bulegoa

Zelai Haundi, 3. Osinalde industrialdea
20170 Usurbil

*www.elhuyar.org* <http://www.elhuyar.org>* | **ixa.si.ehu.es

Mezu honek, baita erantsitako edozein agirik ere, isilpeko informazioa izan
dezake. Informazio hori jasotzeko baimena izendatutakoak baino ez du. Zu ez
bazara adierazitako hartzailea, indarrean dagoen legeriaren arabera
debekatuta daukazu informazio hori baimenik gabe erabili, hedatu
eta/edo kopiatzea.
Mezu hau errakuntza baten ondorioz jaso baduzu, jakinarazi bidaltzaileari,
eta ezaba ezazu. Eskerrik asko.

Ez inprimatu mezu hau ezinbestekoa ez bada.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140528/9ca01a43/attachment.htm>
-------------- next part --------------
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list