Corpora: MA studentship available to study South Asian languages

Mcenery, Tony eiaamme at exchange.lancs.ac.uk
Tue May 14 14:18:01 UTC 2002


Dear All,

please feel free to pass details of this studentship on to anybody you think
may be interested in applying. Regards,

Tony



MA Studentship - The corpus-based study of South Asian languages



Lancaster University Department of Linguistics and MEL is offering an MA
studentship in computer corpus-related research into the languages of South
Asia. The studentship is part of the EPSRC-funded EMILLE project which is
collecting a 67 million words of corpus data in Bengali, Gujarati, Hindi,
Punjabi, Sinhala, Tamil or Urdu. This corpus data will form the basis of the MA
student's research. See the following website for more details:

  <http://www.emille.lancs.ac.uk/> http://www.emille.lancs.ac.uk/

 Applicants should be native speakers of whichever language they wish to
undertake research on. Note that the University also requires documentary proof
of an average IELTS score of 6.5 for all non-native speakers of English.

 The studentship will run from the start of October 2002 to the end of
September 2003. The studentship will cover fees (home or overseas) and provide
a living allowance. No assistance with relocation is available from the
studentship.

 Applicants should be willing to undertake research in one of the four research
areas listed below. To apply, download the application forms from the following
URL:

  <http://www.ling.lancs.ac.uk/courses/research/>
http://www.ling.lancs.ac.uk/courses/research/

 In making the application, candidates should complete the application form and
write 'EMILLE' on the form where a source of funding is asked for.
Additionally, candidates should include a one page description of how they
propose to pursue the research topic they have chosen. Closing date for
applications is 1st July 2002.

 Students may choose to research one of the following research topics:

 1.      Corpus-based dictionary creation

 Many of the current standard dictionaries of South Asian languages are quite
old and are generally not corpus-based. Using the EMILLE corpora as source of
data, the student will develop lexicographic resources for Bengali, Gujarati,
Hindi, Punjabi, Sinhala, Tamil or Urdu. Throughout the goal will be to apply
the latest research in the field of corpus-based lexicography to South Asian
languages.

 2.      Anaphora and anaphor resolution in Bengali, Gujarati, Hindi, Punjabi,
Sinhala, Tamil or Urdu

 Much research has been undertaken on automated anaphor resolution for West
European languages. Research focused on the languages of South Asia is, by
contrast, relatively undeveloped. Students wishing to pursue research on
anaphor resolution for South Asian language may care to focus either on a
corpus-based account of the anaphors of one of the languages listed above, or
may seek to develop algorithms for automated anaphor resolution for one of
Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil or Urdu.

 3.      Machine translation between Hindi and Urdu

 Hindi and Urdu are very similar languages in their spoken form, but differ
greatly in their written form. Using the EMILLE corpus as a data source, the
student will develop, test and evaluate software that can translate Hindi texts
into Urdu (and vice versa).

 4.      Studying spoken language

 The student will study the spoken data in the EMILLE corpus in order to fulfil
one of the following research goals:

*	examining the differences between the spoken and written forms of the
language;
*	contrasting the dialects of the language spoken in the UK and South
Asia;
*	analysing code-switching in spoken texts.

 The languages which may be studied for this project are Bengali, Gujarati,
Hindi-Urdu or Punjabi.







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020514/da62f007/attachment.htm>


More information about the Corpora mailing list