[Corpora-List] Ambiguous words in English and their frequency
Sebastian Hellmann
hellmann at informatik.uni-leipzig.de
Thu Jan 26 08:58:43 UTC 2012
Hi Karen,
I don't have an answer for your question, but I was intrigued how you
would calculate proof for the claim:
"in $language p1% of the words represent p2% of the ambiguity
Here is my try:
You would take a dictionary and then count the number of defined
meanings per entry.
Let's define that "ambiguity" only occurs in context and words (or
tokens) with several meaning in a dictionary are called "polysemous".
So all polysemous tokens would have more than one meaning in the dictionary.
Then you take all polysemous words and create sensible surface forms
(such as add plural 's' ).
Then you would need to take another corpus that counts word/token
probabilities in real life texts.
Then you can calculate what share polysemous tokens take in overall word
usage, right?
So in English that would be quite a lot:
http://en.wiktionary.org/wiki/the has several meanings and makes up
around 7% of all words in the Brown Corpus.
There you would have my first hypothesis:
"in English the word 'the' represents 7% of the ambiguity"
Overall, it is a really nice question, as it can only be answered by
corpus analysis.
Any human rater would probably not consider 'the' ambigous without a
certain sensitivity to linguistics.
I am currently trying to integrate Wortschatz and Wiktionary via RDF and
will try to actually calculate, what I sketched above.
It is a very interesting question and can also be used to measure
coverage and completeness of dictionaries.
All the best,
Sebastian
On 01/25/2012 08:33 PM, FORT, Karen wrote:
> Hi all,
>
> I need to find this information (the proportion of ambiguous words in English and their frequency).
> For example, we know that in French 8% of the words represent 30% of the ambiguity.
> Of course, it's very rough, but it's only to have a rough idea.
>
> Can somebody help me with this (of course, I searched for a ref but could not find anything precise)?
>
> Thank you in advance,
>
> Regards,
>
>
> Karën FORT
> Ingénieure/Engineer et/and doctorante/PhD student
> INIST-CNRS / LIPN
> 2, allée de Brabois
> 54500 Vandoeuvre-lès-Nancy
> France
> Bureau/Office: H112
> +33 (0)3 83 50 46 36
>
> http://www-lipn.univ-paris13.fr/~fort/
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list