[Corpora-List] English POS tagged corpus

Eric Atwell eric at comp.leeds.ac.uk
Fri Nov 19 15:49:06 UTC 2004


Gaurav,

The SourceForge open-source Python Natural Language Toolkit (NLTK)
http://nltk.sourceforge.net/
is a student-oriented teaching resource with a bundle of corpus and
lexical resources including PoS-tagged Brown corpus of US English:

20_newsgroups  genesis    lexicon        roget      treebank
brown          gutenberg  names          semcor1.7  treebank_swb
chunking       ieer       nltk-data-0.3  senseval   wordnet
cmp-lg         levin      ppattach       stopwords  words

It also comes with demo software and easy-to-follow tutorials and
API documentation for tokenization, tagging, parsing, and probabilistic
modelling.  As it's open-source, new contributions keep on coming;
eg latest News says "Christopher Maloof's implementation of the Brill
tagger has been added to the development version of NLTK".

Of course, other tagged corpora are available from ICAME, LDC, ELRA etc
but you may have to pay, and they dont come with demo software/tutorials
(admittedly you didnt say you wanted any associatied software/tutorials
:-)

hope this helps

Eric
-
Eric Atwell, Senior Lecturer, Computer Vision and Language research group,
School of Computing, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric
On Fri, 19 Nov 2004, Gaurav Malhotra wrote:

> Hi,
>    Is there an English Parts-of-Speech corpus available for download on the internet. I will be very grateful.
>    Gaurav Malhotra
>
>
> ---------------------------------
> Do you Yahoo!?
> The all-new My Yahoo! – Get yours free!


More information about the Corpora mailing list