[Corpora-List] announcing pukwac and wackypedia
Linas Vepstas
linasvepstas at gmail.com
Sun Jan 3 18:56:49 UTC 2010
Hi,
2009/12/19 Marco Baroni <marco.baroni at unitn.it>:
> We are happy to announce that you can download two new resources from the
> site of WaCky (Web as Corpus kool ynitiative):
>
> http://wacky.sslmit.unibo.it/
>
> 1) pukWaC: the ukWaC corpus, a 2 billion Web-derived corpus of English, now
> enriched with a full dependency parse (POS-tagging and lemmatization done
> with the TreeTagger, parsing done with the MaltParser);
>
> 2) WaCkypedia: a full 2009 English Wikipedia dump (about 800 million
> tokens), POS-tagged, lemmatized and dependency parsed with the same tools
> used for pukWaC.
If I may, I'd like to announce a smaller but similar project to provide
a tagged, dependency-parsed copy of Wikipedia. Since it is tagged
and parsed with a different set of technology, perhaps it may be useful
for comparative purposes.
The data is available here:
http://gnucash.org/linas/nlp/
The texts were dependency parsed with a combination of RelEx
http://opencog.org/wiki/RelEx and Link Grammar
http://www.abisource.com/projects/link-grammar/,
and are marked with both dependencies (subject, object, prepositional
relations, etc.), with features (part-of-speech tags, verb-tense
and noun-number tags, etc., with Link Grammar linkage relations,
and with phrasal constituency structure. The data is in the RelEx
compact output http://opencog.org/wiki/RelEx_compact_output
format. This format captures all of the parser output in an
easy-to-handle format, meant to be easy-to-treat with basic perl scripts.
An example script is provided.
Although the project is currently a personal project, I am interested
in collaboration to expand its scope and quality.
-- Dr. Linas Vepstas
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list