[Corpora-List] announcing pukwac and wackypedia

Eric Atwell csc6ea at leeds.ac.uk
Mon Jan 4 12:17:03 UTC 2010


Marco, Linas,

thanks for making available these dependency-parsed English corpora.

What do you see these being used for? 
What are the useful applications of dependency-parsed treebanks?

Leeds PhD student Kais Dukes is developing a dependency-parsed
Arabic Quranic Corpus, see http://corpus.quran.com/treebank.jsp
but we are not clear yet what it could be used for; and we'd like to 
persuade scholars in Arabic and Quranic Studies that dependency parses 
can be useful (apart from training machine-learning parsers...)

thanks for any pointers or suggestions


Eric Atwell,
  Senior Lecturer, Language research group, School of Computing,
  Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
  TEL: 0113-3435430  FAX: 0113-3435468  WWW/email: google Eric Atwell


On Sun, 3 Jan 2010, Linas Vepstas wrote:

> Hi,
>
> 2009/12/19 Marco Baroni <marco.baroni at unitn.it>:
>> We are happy to announce that you can download two new resources from the
>> site of WaCky (Web as Corpus kool ynitiative):
>>
>> http://wacky.sslmit.unibo.it/
>>
>> 1) pukWaC: the ukWaC corpus, a 2 billion Web-derived corpus of English, now
>> enriched with a full dependency parse (POS-tagging and lemmatization done
>> with the TreeTagger, parsing done with the MaltParser);
>>
>> 2) WaCkypedia: a full 2009 English Wikipedia dump (about 800 million
>> tokens), POS-tagged, lemmatized and dependency parsed with the same tools
>> used for pukWaC.
>
> If I may, I'd like to announce a smaller but similar project to provide
> a tagged, dependency-parsed copy of Wikipedia.  Since it is tagged
> and parsed with a different set of technology, perhaps it may be useful
> for comparative purposes.
>
> The data is available here:
> http://gnucash.org/linas/nlp/
>
> The texts were dependency parsed with a combination of RelEx
> http://opencog.org/wiki/RelEx  and Link Grammar
> http://www.abisource.com/projects/link-grammar/,
> and are marked with both dependencies (subject, object, prepositional
> relations, etc.), with features (part-of-speech tags, verb-tense
> and noun-number tags, etc., with Link Grammar linkage relations,
> and with phrasal constituency structure.  The data is in the RelEx
> compact output http://opencog.org/wiki/RelEx_compact_output
> format.  This format captures all of the parser output in an
> easy-to-handle format, meant to be easy-to-treat with basic perl scripts.
> An example script is provided.
>
> Although the project is currently a personal project, I am interested
> in collaboration to expand its scope and quality.
>
> -- Dr. Linas Vepstas
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list