[Corpora-List] (no subject)

maxwell maxwell at umiacs.umd.edu
Wed Oct 19 19:26:23 UTC 2011


On Wed, 19 Oct 2011 17:56:27 +0100 (BST), Abu Fahad
<salehosaimi at yahoo.com>
wrote:
> Are you aware of any frequency list of Arabic?

Let me up the generality a bit, since "frequency lists of language X"
seems to be a common request.

I would be surprised if there weren't frequency lists for many languages. 
Obviously there are questions (stemmed or not? what does it mean to have a
"balanced" corpus from which to derive such lists?), but such lists
probably have at least some utility.  Is there a place with links to
frequency lists of multiple languages?  

There are lists of "correctly" spelled words for some languages, which are
sometimes grouped into top 10k words, top 20k etc.  I suppose one could
derive a very coarse-grained ranking from such lists.  Obviously these
would be inflected words, not stemmed words.

I looked at the ACL wiki
(http://aclweb.org/aclwiki/index.php?title=Main_Page), but nothing jumped
out at me.  So I'll prime the pump with a few links to such lists I did
find:
   http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
   http://invokeit.wordpress.com/frequency-word-lists/
        (there's a link to a link to an Arabic list here)
   http://borel.slu.edu/crubadan/index.html
        ("...send me an email if you're interested in 
         a particular language and there's plenty of 
         data I am free to share (frequency lists...")

   Mike Maxwell

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list