23.2420, Diss: Computational Ling: Nojoumian: 'Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer'

linguist at linguistlist.org linguist at linguistlist.org
Mon May 21 20:19:49 UTC 2012


LINGUIST List: Vol-23-2420. Mon May 21 2012. ISSN: 1069 - 4875.

Subject: 23.2420, Diss: Computational Ling: Nojoumian: 'Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer'

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is a non-profit organization dedicated to providing the
discipline of linguistics with the infrastructure necessary to function in
the digital world. Donate to keep our services freely available!
https://linguistlist.org/donation/donate/donate1.cfm

Editor for this issue: Xiyan Wang <xiyan at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.


Date: Mon, 21 May 2012 16:19:18
From: Peyman Nojoumian [nojoumia at usc.edu]
Subject: Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-2420.html&submissionid=4546842&topicid=14&msgnumber=1
 
Institution: University of Ottawa 
Program: Department of Linguistics 
Dissertation Status: Completed 
Degree Date: 2011 

Author: Peyman Nojoumian

Dissertation Title: Towards the Development of an Automatic Diacritizer for the 
Persian Orthography based on the Xerox Finite State
Transducer 

Dissertation URL:  http://www.ruor.uottawa.ca/en/handle/10393/20158

Linguistic Field(s): Computational Linguistics


Dissertation Director(s):
Diana Inkpen
Paul Hirschbuhler

Dissertation Abstract:

Due to the lack of short vowels or diacritics in Persian orthography, many
Natural Language Processing applications for this language, including
information retrieval, machine translation, text-to-speech, and automatic
speech recognition systems need to disambiguate the input first, in order
to be able to do further processing. In machine translation, for example,
the whole text should be correctly diacritized first so that the correct
words, parts of speech and meanings are matched and retrieved from the
lexicon. This is primarily because of Persian's ambiguous orthography. In
fact, the core engine of any Persian language processor should utilize a
diacritizer and a lexical disambiguator. This dissertation describes the
design and implementation of an automatic diacritizer for Persian based on
the state-of-the-art Finite State Transducer technology developed at Xerox
by Beesley & Karttunen (2003). The result of morphological analysis and
generation on a test corpus is shown, including the insertion of
diacritics. This study will also look at issues that are raised by
phonological and semantic ambiguities as a result of short vowels in
Persian being absent in the writing system. It suggests a hybrid model
(rule-based & inductive) that is inspired by psycholinguistic experiments
on the human mental lexicon for the disambiguation of heterophonic
homographs in Persian using frequency and collocation information. A
syntactic parser can be developed based on the proposed model to discover
Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate
homographs, but its implementation is left for future work. 






----------------------------------------------------------
LINGUIST List: Vol-23-2420	
----------------------------------------------------------



More information about the LINGUIST mailing list