[Corpora-List] Ambiguous words in English and their frequency

Dr. Damir Cavar dcavar at indiana.edu
Thu Feb 2 13:45:42 UTC 2012


Hi Karen,

an interesting question. It would be interesting to see, how this differs with extended tag-sets. From a more theoretical perspective, not being into the technical aspects of that, if your POS-set does not encode semantic or functional features, you might just conclude that the ambiguity on the morphological and syntactic levels is somewhat mirrored by it, taking just the set of represented features in your tag-set. I would expect a very different number for richer tag-sets that include semantic (or functional, pragmatic etc.) features. A comparison over specific features for specific languages would actually be highly interesting, i.e. the ambiguity index for specific features. But, I guess, we simply do not have enough corpora with such details. Or do we? Has this been studied somewhere?

Best wishes
DC

On Feb 2, 2012, at 5:25 AM, Karen Fort wrote:

> I could not find the time to precise my question and then received a lot of very interesting answers and references.
> Thank you all for this!
> 
> In fact, I should have said that I'm looking for the number of ambiguous word tokens in terms of POS in an English corpus, for example from the Penn TreeBank. One solution would be to compute this myself from the Brown corpus, but I was curious if there was a ref. on this.
> 
> I found this ref for French that says 60% of the French tokens in their corpus were non ambiguous in terms of POS:
> Tzoukermann, E.; Radev, D. R. & Gale, W. A. Ken Church, Susan Armstrong, P. I. E. T. & Yarowsky, D. (ed.) Natural Language Processing Using Very Large Corpora Tagging french without lexical probabilities -- combining linguistic knowledge and statistical learning Kluwer Academic, 1999
> 
> Of course, it all depends on the number of tags, their refinement et so on. It only gives a very rough idea and should be taken in its context, obviously. But that's all I need.

--
Dr. Damir Cavar
http://cavar.me/damir/





_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list