[Corpora-List] Free text corpora?

Raphael Mudge raffi at automattic.com
Tue Mar 2 21:31:29 UTC 2010


Hi Xin,
A collection of plain text files of public domain books is available  
from Project Gutenberg:

http://www.gutenberg.org/wiki/Main_Page

You can also download Wikipedia and convert the data into plain text.

http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/

If you need to mark-up the corpus with a POS tagger, Stanford's POS  
tagger may work for you.

http://nlp.stanford.edu/software/tagger.shtml

-- Raphael

Raphael Mudge
Code Wrangler, Automattic
http://www.afterthedeadline.com

On Mar 2, 2010, at 6:38 AM, Xin Yan wrote:

> Hello,
>
> can anyone tell me, if there are some free text corpora for  
> commercial purpose?
> Thank you in advance!
>
> Best,
> Xin Yan
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list