[Corpora-List] CFP: Shared Task on Transliterated Search at FIRE 2014
Parth Gupta
pgupta at dsic.upv.es
Wed Aug 13 03:48:26 UTC 2014
(Apologies if you receive multiple copies of this call)
-------------------------------------------------
1st Call for Participation
-------------------------------------------------
Transliterated Search Track at FIRE 2014 (2nd Edition)
5 - 7 December 2014, Bangalore, India
http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/fire14st.aspx
-------------------------------------------------
A large number of languages, including Arabic, Russian, and most of
the South and South East Asian languages, are written using indigenous
scripts. However, often the websites and the user generated content
(such as tweets and blogs) in these languages are written using Roman
script due to various socio-cultural and technological reasons. This
process of phonetically representing the words of a language in a
non-native script is called transliteration. Transliteration,
especially into Roman script, is used abundantly on the Web not only
for documents, but also for user queries that intend to search for
these documents.
A challenge that search engines face while processing transliterated
queries and documents is that of extensive spelling variation. For
instance, the word dhanyavad ("thank you" in Hindi and many other
Indian languages) can be written in Roman script as dhanyavaad,
dhanyvad, danyavad, danyavaad, dhanyavada, dhanyabad and so on. The
aim of this shared task is to systematically formalize several
research problems that one must solve to tackle this unique situation
prevalent in Web search for users of many languages around the world,
develop related data sets, test benches and most importantly, build a
research community around this important problem that has received
very little attention till date.
In the second edition of the shared task, following two sub-tasks will
be hosted:
* Subtask-I: Query Word Labeling
Suppose that q: w1 w2 w3 ? wn, is a query is written Roman script. The
words, w1 w2 etc., could be standard English words or transliterated
from another language L. The task is to label the words as E or L
depending on whether it an English word, or a transliterated
L-language word. And then, for each transliterated word, provide the
correct transliteration in the native script (i.e., the script which
is used for writing L).
* Subtask-II: Mixed-script Ad hoc retrieval for Hindi Song Lyrics
Input is query written in Devanagari script or its Roman
transliterated form of a (possibly partial or incorrect) Hindi song
title or some part of the lyrics. Output is a ranked list of songs
both in Devanagari and Roman scripts, retrieved from a corpus of Hindi
film lyrics, where some of the documents are in Devanagari and some in
Roman transliterated form.
-----------------------------------------------
Important Dates
-----------------------------------------------
Registration for the task: 1st Sep 2014
Training/Dev data release: 5th Sep 2014
Test Set release: 30th Sep 2014
Submit Run: 13th Oct 2014
Results distributed: 20th Oct 2014
Working Note due: 10th Nov 2014
FIRE Workshop: 5-7th Dec 2014
-------------------------------------------------------------------------------
Contact
-------------------------------------------------------------------------------
E-mail: Monojit Choudhury (monojitc at microsoft.com) and Parth
Gupta (pgupta at dsic.upv.es)
Track Web page:
http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/fire14st.aspx
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list