[Corpora-List] CFP: Shared Task on Transliterated Search at FIRE 2014

Wed Aug 13 03:48:26 UTC 2014

(Apologies if you receive multiple copies of this call)

-------------------------------------------------
1st Call for Participation
-------------------------------------------------

Transliterated Search Track at FIRE 2014 (2nd Edition)

5 - 7 December 2014, Bangalore, India

http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/fire14st.aspx

-------------------------------------------------

A large number of languages, including Arabic, Russian, and most of  
the South and South East Asian languages, are written using indigenous  
scripts. However, often the websites and the user generated content  
(such as tweets and blogs) in these languages are written using Roman  
script due to various socio-cultural and technological reasons. This  
process of phonetically representing the words of a language in a  
non-native script is called transliteration. Transliteration,  
especially into Roman script, is used abundantly on the Web not only  
for documents, but also for user queries that intend to search for  
these documents.

A challenge that search engines face while processing transliterated  
queries and documents is that of extensive spelling variation. For  
instance, the word dhanyavad ("thank you" in Hindi and many other  
Indian languages) can be written in Roman script as dhanyavaad,  
dhanyvad, danyavad, danyavaad, dhanyavada, dhanyabad and so on. The  
aim of this shared task is to systematically formalize several  
research problems that one must solve to tackle this unique situation  
prevalent in Web search for users of many languages around the world,  
develop related data sets, test benches and most importantly, build a  
research community around this important problem that has received  
very little attention till date.

In the second edition of the shared task, following two sub-tasks will  
be hosted:

* Subtask-I: Query Word Labeling

Suppose that q: w1 w2 w3 ? wn, is a query is written Roman script. The  
words, w1 w2 etc., could be standard English words or transliterated  
from another language L. The task is to label the words as E or L  
depending on whether it an English word, or a transliterated  
L-language word. And then, for each transliterated word, provide the  
correct transliteration in the native script (i.e., the script which  
is used for writing L).

* Subtask-II: Mixed-script Ad hoc retrieval for Hindi Song Lyrics

Input is query written in Devanagari script or its Roman  
transliterated form of a (possibly partial or incorrect) Hindi song  
title or some part of the lyrics. Output is a ranked list of songs  
both in Devanagari and Roman scripts, retrieved from a corpus of Hindi  
film lyrics, where some of the documents are in Devanagari and some in  
Roman transliterated form.

-----------------------------------------------
Important Dates
-----------------------------------------------

     Registration for the task: 1st Sep 2014
     Training/Dev data release: 5th Sep 2014
     Test Set release: 30th Sep 2014
     Submit Run: 13th Oct 2014
     Results distributed: 20th Oct 2014
     Working Note due: 10th Nov 2014
     FIRE Workshop: 5-7th Dec 2014

-------------------------------------------------------------------------------
Contact
-------------------------------------------------------------------------------

     E-mail: Monojit Choudhury (monojitc at microsoft.com) and Parth  
Gupta (pgupta at dsic.upv.es)

     Track Web page:  
http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/fire14st.aspx

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora