[Corpora-List] German document and token frequencies

Adriano Ferraresi a.ferraresi at gmail.com
Tue Jan 8 12:00:26 UTC 2008


Hi Andrea,

you could consider deWaC, a 1.7 billion word general-language corpus
constructed from the
Web. From there you can easily obtain a token frequency list, that you could
use as reference for your purposes.

For further info visit http://wacky.sslmit.unibo.it/ (in the "Available
corpora" section you can find instructions on how to obtain the corpus).

Regards,

Adriano


2008/1/8, Andrea Mulloni <andrea2 at wlv.ac.uk>:
>
>  Dear all,
> I am currently looking for a German document and token frequency list to
> use as a reference. The original corpus from where the lists are taken could
> actually be of any size > 1 M tokens. Can anyone help me with any pointer?
>
> Thanks in advance for any suggestion,
>
> Andrea
>
>
>  -------------
>
> Andrea Mulloni
> PT_PhD_S (Part-Time PhD Student)
> Computational Linguistics Group
> University of Wolverhampton
> Wolverhampton
> United Kingdom
>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080108/9b30685f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list