[Corpora-List] Merging verb phrase ellipsis annotations with the WSJ treebank

Rebecca Dridan rdrid at dridan.com
Thu Jun 13 12:09:54 UTC 2013


Hi Alan,

We needed an alignment between the WSJ raw text and the WSJ .mrg files for
our tokenisation work (here:
http://aclweb.org/anthology-new/P/P12/P12-2074.bib). It is not exactly what
you are after, since I extracting the aligned raw text, rather than
calculating stand-off annotations, but I have code (Perl or C++) which
might be useful, at least for that data set. I'm happy to share the code,
but it is not currently packaged as a general solution. Contact me off-list
if you think it could be helpful.

Rebecca


On Tue, Jun 11, 2013 at 11:04 PM, E. Alan Hogue
<eahogue at email.arizona.edu>wrote:

> Hello Corpora List,
>
> As you may know, not long ago this article was published:
>
> Bos, J., & Spenader, J. (2011). An annotated corpus for the analysis of VP
> ellipsis. Language Resources and Evaluation, 45(4), 463–494.
> doi:10.1007/s10579-011-9142-3
>
> Along with this, the authors made available a file of standoff annotation
> based on the raw version (non-parsed, non-tagged) of the WSJ in the Penn
> Treebank.
>
> http://www.let.rug.nl/bos/vpe/annotations.html
>
> I am currently trying to figure out the best way to merge or align this
> with the _parsed_ version of the WSJ, and this is turning out to be
> trickier than I expected. It occurs to me that this might in general be a
> problem someone else has solved before.
>
> Does anyone know of any code, modules, packages, algorithms, tricks, etc
> that already do a good job of this type of thing, and which I might modify
> for this particular task? If it happens to be in Python that is a plus, but
> just about any language/platform will do.
>
> Thank you!
>
> Alan Hogue
> University of Arizona
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130613/3cac5e0d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list