[Corpora-List] coreference annotation for penn treebank
Yannick Versley
yannick.versley at unitn.it
Mon Feb 16 17:03:21 UTC 2009
> Is anyone aware of any other large-scale coreference annotation efforts for
> the Wall Street Journal portion of the Penn TreeBank?
The ARRAU corpus combines the Vieira&Poesio data with some more data that has
been annotated more recently
http://cswww.essex.ac.uk/Research/nle/arrau/arrau-corpus-lrec2008
you would have to ask Massimo Poesio or Ron Artstein about the availability -
I'm not sure if there has been an official release (as in: distributing the
thing via a website) of it.
The OntoNotes project has annotated a portion of the PTB:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04
The format it comes in is somewhat weird (SGML that is *not* meant to be
parsed by an XML parser, the traces in the treebank appear as tokens which
means that you have to figure out yourself which "0" in the treebank is
really a token, and even in the second release, there are still obvious
errors in it where [Korea and Japan] is coreferent with Korea, but "[Korea]
and Japan" with "those two countries), but it's about 10x as big as MUC-6 and
should definitely be worth a look.
The only other coreference resources of that size that I know of would be the
ACE corpora (which annotate only some semantic classes), and the TüBa-D/Z
treebank-plus-coreference (which is in German).
Best,
Yannick
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list