[Corpora-List] Ambiguous words in English and their frequency

Kevin Brubeck Unhammer unhammer at gmail.com
Thu Jan 26 09:53:11 UTC 2012


Sebastian Hellmann <hellmann at informatik.uni-leipzig.de> writes:

> Hi Karen,
> I don't have an answer for your question, but I was intrigued how you
> would calculate proof for the claim:
> "in $language  p1% of the words represent p2% of the ambiguity
>
> Here is my try:
>
> You would take a dictionary and then count the number of defined
> meanings per entry.
> Let's define that "ambiguity" only occurs in context and words (or
> tokens) with several meaning in a dictionary are called "polysemous".
> So all polysemous tokens would have more than one meaning in the
> dictionary.

Ambiguity could also mean plain morphological ambiguity, e.g. "a bank",
a noun, vs "to bank", a verb used by airline pilots, and on a more
fine-grained level: "to bank", infinitive, vs "we bank", present tense
indicative non-3SG.

Morphologican ambiguity is easier to count than word-sense ambiguity
since (1) corpora and taggers often don't go further, and (1) with
word-sense ambiguity it's very hard to know how far to go ("river bank"
vs "financial institution" is uncontroversial, but do you divide
"building of financial institution" from "legal entity of financial
institution"?). With morphological ambiguity, on the other hand, it is
in most cases easy to test how ambiguous a form is[1]. With word-senses
you need some framework (or dictionary/Wiktionary) to constrain you.


[1]   At least if you stick to observed sentences and don't go 
      "but I could easily verb that noun" all the time.

> Then you take all polysemous words and create sensible surface forms
> (such as add plural 's' ).

Collecting all forms complicates things a bit; a word might be
polysemous in singular, but monosemous (is that a word?) in plural. It
happens with mass nouns, e.g. "paper" vs "papers", where the plural
can't mean pieces of paper. (And then "bank" is no longer ambiguous
between a noun and a verb.)



regards,
Kevin Brubeck Unhammer


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list