[Corpora-List] Ambiguous words in English and their frequency

Sebastian Hellmann hellmann at informatik.uni-leipzig.de
Thu Jan 26 08:58:43 UTC 2012


Hi Karen,
I don't have an answer for your question, but I was intrigued how you 
would calculate proof for the claim:
"in $language  p1% of the words represent p2% of the ambiguity

Here is my try:

You would take a dictionary and then count the number of defined 
meanings per entry.
Let's define that "ambiguity" only occurs in context and words (or 
tokens) with several meaning in a dictionary are called "polysemous".  
So all polysemous tokens would have more than one meaning in the dictionary.

Then you take all polysemous words and create sensible surface forms 
(such as add plural 's' ).
Then you would need to take another corpus that counts word/token 
probabilities in real life texts.

Then you can calculate what share polysemous tokens take in overall word 
usage, right?
So in English that would be quite a lot:
http://en.wiktionary.org/wiki/the  has several meanings and makes up 
around 7% of all words in the Brown Corpus.
There you would have my first hypothesis:
"in English the word 'the' represents 7% of the ambiguity"

Overall, it is a really nice question, as it can only be answered by 
corpus analysis.
Any human rater would probably not consider 'the' ambigous without a 
certain sensitivity to linguistics.


I am currently trying to integrate Wortschatz and Wiktionary via RDF and 
will try to actually calculate, what I  sketched above.
It is a very interesting question and can also be used to measure 
coverage and completeness of dictionaries.

All the best,
Sebastian


On 01/25/2012 08:33 PM, FORT, Karen wrote:
> Hi all,
>
> I need to find this information (the proportion of ambiguous words in English and their frequency).
> For example, we know that in French 8% of the words represent 30% of the ambiguity.
> Of course, it's very rough, but it's only to have a rough idea.
>
> Can somebody help me with this (of course, I searched for a ref but could not find anything precise)?
>
> Thank you in advance,
>
> Regards,
>
>
> Karën FORT
> Ingénieure/Engineer et/and doctorante/PhD student
> INIST-CNRS / LIPN
> 2, allée de Brabois
> 54500 Vandoeuvre-lès-Nancy
> France
> Bureau/Office: H112
> +33 (0)3 83 50 46 36
>
> http://www-lipn.univ-paris13.fr/~fort/
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list