[Corpora-List] Ambiguous words in English and their frequency

Kevin B. Cohen kevin.cohen at gmail.com
Thu Feb 2 12:47:48 UTC 2012


Last semester, one of my students did a project that showed the rate
of growth in ambiguous word/POS pairs as you read in increasing
amounts of text in, if I remember correctly, the WSJ and one
biomedical corpus.  Would his results be helpful?

Kev

On Thu, Feb 2, 2012 at 10:25 AM, Karen Fort <karen.fort at inist.fr> wrote:
> Hi all,
>
> I could not find the time to precise my question and then received a lot of
> very interesting answers and references.
> Thank you all for this!
>
> In fact, I should have said that I'm looking for the number of ambiguous
> word tokens in terms of POS in an English corpus, for example from the Penn
> TreeBank. One solution would be to compute this myself from the Brown
> corpus, but I was curious if there was a ref. on this.
>
> I found this ref for French that says 60% of the French tokens in their
> corpus were non ambiguous in terms of POS:
> Tzoukermann, E.; Radev, D. R. & Gale, W. A. Ken Church, Susan Armstrong, P.
> I. E. T. & Yarowsky, D. (ed.) Natural Language Processing Using Very Large
> Corpora Tagging french without lexical probabilities -- combining linguistic
> knowledge and statistical learning Kluwer Academic, 1999
>
> Of course, it all depends on the number of tags, their refinement et so on.
> It only gives a very rough idea and should be taken in its context,
> obviously. But that's all I need.
>
> Best,
>
> Karen
>
>
> Le 26/01/2012 10:39, Eckhard Bick a écrit :
>>
>> Hello again,
>>
>> I forgot to add, that the ambiguous word tokens in my English test run
>> amounted to 49.8%.
>>
>> Best,
>> Eckhard
>>
>> On 2012-01-25 20:33, FORT, Karen wrote:
>>>
>>> Hi all,
>>>
>>> I need to find this information (the proportion of ambiguous words in
>>> English and their frequency).
>>> For example, we know that in French 8% of the words represent 30% of the
>>> ambiguity.
>>> Of course, it's very rough, but it's only to have a rough idea.
>>>
>>> Can somebody help me with this (of course, I searched for a ref but could
>>> not find anything precise)?
>>>
>>> Thank you in advance,
>>>
>>> Regards,
>>>
>>>
>>> Karën FORT
>>> Ingénieure/Engineer et/and doctorante/PhD student
>>> INIST-CNRS / LIPN
>>> 2, allée de Brabois
>>> 54500 Vandoeuvre-lès-Nancy
>>> France
>>> Bureau/Office: H112
>>> +33 (0)3 83 50 46 36
>>>
>>> http://www-lipn.univ-paris13.fr/~fort/
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>
> --
> Karën FORT
> Ingénieure/Engineer et/and doctorante/PhD student
> INIST-CNRS / LIPN
> 2, allée de Brabois
> 54500 Vandoeuvre-lès-Nancy
> France
> Bureau/Office: H112
> +33 (0)3 83 50 46 36
>
> http://www-lipn.univ-paris13.fr/~fort/
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



-- 
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program,
U. Colorado School of Medicine
303-916-2417 (cell) 303-377-9194 (home)
http://compbio.ucdenver.edu/Hunter_lab/Cohen

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list