[Corpora-List] coreference annotation for penn treebank

Yannick Versley yannick.versley at unitn.it
Mon Feb 16 17:03:21 UTC 2009


> Is anyone aware of any other large-scale coreference annotation efforts for
> the Wall Street Journal portion of the Penn TreeBank?
The ARRAU corpus combines the Vieira&Poesio data with some more data that has 
been annotated more recently
http://cswww.essex.ac.uk/Research/nle/arrau/arrau-corpus-lrec2008
you would have to ask Massimo Poesio or Ron Artstein about the availability - 
I'm not sure if there has been an official release (as in: distributing the 
thing via a website) of it.
The OntoNotes project has annotated a portion of the PTB:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04
The format it comes in is somewhat weird (SGML that is *not* meant to be 
parsed by an XML parser, the traces in the treebank appear as tokens which 
means that you have to figure out yourself which "0" in the treebank is 
really a token, and even in the second release, there are still obvious 
errors in it where [Korea and Japan] is coreferent with Korea, but "[Korea] 
and Japan" with "those two countries), but it's about 10x as big as MUC-6 and 
should definitely be worth a look.
The only other coreference resources of that size that I know of would be the 
ACE corpora (which annotate only some semantic classes), and the TüBa-D/Z 
treebank-plus-coreference (which is in German).

Best,
Yannick

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list