Corpora: Cross Document Coreference

Daniel Winchester d.winchester at cs.bham.ac.uk
Wed May 17 15:22:12 UTC 2000


Dear All,

I have recently undertaken a NLP PhD with the working title of
'Cross-Document Coreference' in the computer science department of the
University of Birmingham.   To get to the point, I am using the term
cross-document coreference to denote multiple, and often variant,
references to the same entity from different texts.  This usage follows
from the handful of papers from the NLP community that outline systems
designed to disambiguate such references (e.g.. the work of Breck
Baldwin and Amit Bagga).

Thus; in different documents, 'Clinton', 'William Clinton', 'William
Jefferson Clinton' etc. ,when referring to the president, could all be
said to 'corefer' but 'Bill Clinton', the new york policeman or
'Clinton', the town in Arizona would not.

I am aware that this 'coreference' is profoundly different from that
found within documents, and that the terminology itself is
problematic.   Coreference within a discourse/text relies on
relationships that are intended to allow the reader to resolve any
ambiguity, this is obviously not the case for references in unrelated
texts to the same entity.  Nevertheless, for the time being I will use
the term cross-document coreference.

I am hoping for some help on the following:

1.  Are there any corpora available that are marked for cross-document
coreference?

       I know that this is unlikely but anything where all references in
the corpus to the same entity are related in some way would be very
useful.

2.  Does anyone know if this sort of work is being done or has been done
elsewhere under a different name or in a different discipline?

        It seems the sort of task that Information Retrieval (IR) would
be interested in, but, to date, I have found no equivalent work.
        I'm basically after any suggestions that people might have for
where this is already being looked at, for other news groups that I
should post a query on, or for alternative disciplines and terminology
that might be relevant.

Hope that you will be able to help.

Kind Regards

Daniel Winchester

Research Student
Computer Science Dept
University of Birmingham



More information about the Corpora mailing list