Corpora: Morfological ambiguity

Mon Jan 8 11:55:04 UTC 2001

At 20:55 07.01.01 +0000, Hristo Tanev wrote:
>Dear all,
>I work in the area of ambiguity resolution for
>Bulgarian. I obtained the following result concerning
>morfological ambiguity.I have measured the ratio
> 
>
>Number of all morf.hypothesis for all words/Number of
>words
>
>
>
>This ratio for Bulgarian is 1.27-1.33 and doesn't vary
>too much. 
>My question is : does someone of you know this average
>ratio for English or other language? Does this ratio
>depend on the genre?
>
>Hristo
>

Dear Hristo,

There are several vague things in your mail. What is "all words"? And: are
looking at tokens or types?

I assume that you are measuring in a corpus, so "all words" are all words
that occur in the corpus and not in a lexicon or any list. And that a high
frequent word like a preposition is counted as many times as it appears in
the corpus, i.e., I assume you are counting TOKENS.

If this is true, you still have to explain what morphological ambiguity are
you interested in - from between grammatical category (PoS) only to ANY
possible, even if it is systematically ambiguous in your language. For
example, would you consider ARE (the form of the verb BE in English) 4
times ambiguous, or not ambiguous at all, since English never makes a
morphological distinction between plural forms and the second person singular?

A long time ago (Medeiros et al. 1993) we made some measures for European
Portuguese, based on tokens in a corpus, measuring only POS ambiguity, and
then only between 4 kinds: word belonging to a closed class, verb,
noun/adjective, or past participle. The number obtained was 1.02494. Not
counting words only belonging to a closed class, the number raised to
1.1398, but I would advise you to look more carefully both at the setup and
at what the measures may mean, before you directly compare languages (if
that is what you have in mind).

Other ambiguity measures for Portuguese that I know of are the ones
published in Eckhard Bick's recent dissertation, and in Bacelar do
Nascimento et al. (1993). References follow:

Bacelar do Nascimento, Maria Fernanda, José Bettencourt Gonçalves, Lucília
Chacoto, Paula Neto & Luísa Alice      Santos Pereira. 1993. Ambiguidade
morfológica no Português Fundamental. In Actas do 1.o Encontro de
Processamento de Língua Portuguesa (Escrita e Falada) - EPLP'93. Lisboa,
25-26 de Fevereiro de 1993, pp.101-106.

Bick, Eckhard. 2000. The Parsing System "Palavras". Automatic Grammatical
Analisys of Portuguese in a Constraint Grammar Framework. Aarhus University
Press.

Medeiros, José Carlos, Rui Marques & Diana Santos. 1993. Português
Quantitativo. In Actas do 1.o Encontro de Processamento de Língua
Portuguesa (Escrita e Falada) - EPLP'93. Lisboa, 25-26 de Fevereiro de
1993, pp.33-38.

I hope these may be useful at least to those who are interested in
Portuguese :-)
Diana 

**************************************************************************
Diana Santos				Computational processing of Portuguese

SINTEF Telecom and Informatics	Tel. (direct line) +47 22 06 73 12
Forskningsveien 1			Tel. +47 22 06 73 00
Box 124 Blindern			Fax. +47 22 06 73 50
N-0314 Oslo				Email: Diana.Santos at informatics.sintef.no
Norway					http://www.portugues.mct.pt/
**************************************************************************