[Corpora-List] announcing pukwac and wackypedia

Linas Vepstas linasvepstas at gmail.com
Sun Jan 3 18:56:49 UTC 2010


Hi,

2009/12/19 Marco Baroni <marco.baroni at unitn.it>:
> We are happy to announce that you can download two new resources from the
> site of WaCky (Web as Corpus kool ynitiative):
>
> http://wacky.sslmit.unibo.it/
>
> 1) pukWaC: the ukWaC corpus, a 2 billion Web-derived corpus of English, now
> enriched with a full dependency parse (POS-tagging and lemmatization done
> with the TreeTagger, parsing done with the MaltParser);
>
> 2) WaCkypedia: a full 2009 English Wikipedia dump (about 800 million
> tokens), POS-tagged, lemmatized and dependency parsed with the same tools
> used for pukWaC.

If I may, I'd like to announce a smaller but similar project to provide
a tagged, dependency-parsed copy of Wikipedia.  Since it is tagged
and parsed with a different set of technology, perhaps it may be useful
for comparative purposes.

The data is available here:
http://gnucash.org/linas/nlp/

The texts were dependency parsed with a combination of RelEx
http://opencog.org/wiki/RelEx  and Link Grammar
http://www.abisource.com/projects/link-grammar/,
and are marked with both dependencies (subject, object, prepositional
relations, etc.), with features (part-of-speech tags, verb-tense
and noun-number tags, etc., with Link Grammar linkage relations,
and with phrasal constituency structure.  The data is in the RelEx
compact output http://opencog.org/wiki/RelEx_compact_output
format.  This format captures all of the parser output in an
easy-to-handle format, meant to be easy-to-treat with basic perl scripts.
An example script is provided.

Although the project is currently a personal project, I am interested
in collaboration to expand its scope and quality.

-- Dr. Linas Vepstas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list