[Corpora-List] homographs in semitic languages

Isabella Chiari isabella.chiari at uniroma1.it
Thu Jun 28 21:14:46 UTC 2012


Dear Eric,
Your observations are exactly the reasons I was trying to find out if
there is any data on the subject.
I am looking for data on any semitic language right because of constraints
in morphological structure and the relationship with voweled/unvoweled
alphabet systems. I was expecting a general high level of homography
especially in unvoweled texts for Arabic and Hebrew.

I found a general statement on biblical Hebrew (ranging from 20% - voweled
- to 40% of homography - unvoweled) but I could not understand how data
was collected and if it refers to lemma list or on running words
(http://www.logos.com/support/windows/L3/homographs). If 20% is estimated
on lemma list (voweled) it would still seem very high comparing to other
languages (Romance - alphabetic), whereas it seems low (even in unvoweled
data) if referred to running words (on word forms).
I wonder if anyone working on lexicography and on lemmatization might have
gathered some data on homographic lemmas and homographic word forms.

I will let you know if I find something or if anyone points me to some
paper.
Best regards,
Isabella



On 28/06/12 22.40, "Eric Atwell" <E.S.Atwell at leeds.ac.uk> wrote:

>Dear Isabella,
>
>I don't have any quantitative data you ask for - but if ou DO find some,
>I'd be very interested to share!
>
>I assume by "homography" rate you mean the percentage of words which are
>ambiguous, with more than one meaning. This clearly depends on the
>writing system as well as the language. Arabic is (usually) written
>without vowels, whereas Maltese (whcih Habash's textbook on Arabic
>NLP states is a dialect of Arabic, albeit written in a Roman alphabet)
>does include vowels; so you would expect unvoweled Arabic to be
>significantly more ambiguous than voweled Maltese. Other Semitic
>languages use yet different scripts (Hebrew, Amharic)  - so it may not
>make sense to look for generalisations about "percentage of homography
>of texts in semitic languages"
>
>Let me know if you get any quantitative answers please!
>
>
>Eric Atwell, Leeds University
>
>
>
>On Thu, 28 Jun 2012, Isabella Chiari wrote:
>
>> Dear Corpora list members,
>> Can anyone point me to papers that refer to estimates of the rate
>> (percentage) of homography of texts in semitic languages like Arabic.
>> I am interested in quantitative data on word tokens and types and in
>> lexicographic entries also, if available.
>> Thanks for your help!
>> Isabella
>> 
>> -- 
>> 
>> Isabella Chiari
>> 
>> Dipartimento di Scienze documentarie, linguistico-filologiche e
>>geografiche
>> 
>> Università di Roma ³La Sapienza²
>> 
>> pl.le Aldo Moro, 5, III Piano, Edificio ex Facoltà di Lettere e
>>Filosofia,
>> 00185 Roma, tel. +30 06 4991 3575
>> 
>> E.mail: isabella.chiari at uniroma1.it
>> 
>> Website: www.alphabit.net
>> 
>> 
>>
>
>-- 
>Eric Atwell, Associate Professor, Language research group,
>  I-AIBS Institute for Artificial Intelligence and Biological Systems
>  School of Computing, Faculty of Engineering, UNIVERSITY OF LEEDS
>  Leeds LS2 9JT, England.        TEL: 0113-3435430  FAX: 0113-3435468
>  WWW: http://www.comp.leeds.ac.uk/eric
>       http://www.comp.leeds.ac.uk/nlp
>       http://www.comp.leeds.ac.uk/arabic



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list