[Corpora-List] Apply Coreference Resolution in Wikipedia

Joel Nothman joel.nothman at gmail.com
Sat Apr 21 11:49:58 UTC 2012


Hi Daniel,

General coreference resolution may group all noun phrases into
coreferential clusters. The problem is simplified if you're only
interested in particular entity types (e.g. people), or particular
entities, or if you have additional information about those entities. As
such, in Wikipedia, where you have near-gold-standard links to articles
about some of the entities mentioned in an article, you can use
information contained in the linked article (or contained in other pages
with the same link target).

Depending on the needs your task, you may also obtain enough reliable
samples by using high-precision methods: ignore pronoun anaphora, and
ignore cases where a name may be ambiguous (e.g. "Washington" in an
article where both person and city are link targets).

I called this task - matching names in a Wikipedia page to entities linked
   from that page - "link inference" in my work on transforming Wikipedia to
NER training data (see http://schwa.org/projects/resources/wiki/Wikiner
for references). For this application, we could rely on redundancy to
discard low-confidence matches.

I applied a simple heuristic solution that was evaluated extrinsically as
a variable in the Wiki->NER task: when processing article A, collect all
aliases of articles that A links to from various sources which are ranked
according to their reliability. Then basically find the longest matching
strings preferring more reliable alias information, ignoring some
lowercase variants, and discarding conflicts.

For aliases I experimented with:
* Article titles
* Article redirect titles
* Titles and redirect titles of relevant disambiguation pages (important
for
* Final words in titles of person articles
* Text of incoming links

It may also be worth including:
* Bold text in the first paragraph of the article
* Foreign language equivalent titles

(Using the text of all incoming links without considering frequency is
probably a bad idea and consistently reduced my task performance.)

Given that the entire method uses data of questionable reliability and is
naively heuristic, this does a reasonable job, but results are far from
perfect. In particular, extracting reliable disambiguation data is very
difficult.

While it is not feasible for me to send you a corpus of Wikipedia with
these links identified, I may be able to send you the extracted alias
data, and perhaps the Python script for inferring links. Email me
privately if that is of interest.

You may also be interested in the Named Entity Linking (or Disambiguation)
literature, which isn't interested in coreference in Wikipedia text, but
commonly links to Wikipedia entities. It is therefore also interested in
collecting aliases for Wikipedia entities.

Good luck!

Joel Nothman
PhD candidate
School of IT
University of Sydney

On Fri, 20 Apr 2012 20:55:44 +1000, Gerber Daniel
<dgerber at informatik.uni-leipzig.de> wrote:

> Hello,
> I'm currently working on a distant supervision approach for relation  
> extraction. I'm using the english Wikipedia articles to find sentences  
> which contain labels of resources, for example a resource's name like  
> "Barack Obama". My problem is now  that this string only occurs in the  
> first couple of sentences of the article and is then substituted for  
> example with pronouns or things like "The president ..." So what I want  
> to do, is to apply coreference resolution on the complete english  
> Wikipedia (ideally also in other languages like German) and replace  
> those substitutions with the resource name.
>
> Is there a corpus like this already available? If not, would I need to  
> write this myself (using some lib) or are there applications available  
> which are able to do this.
> Also, what would be a good library for this task (speed, accuracy) ? I  
> came across Illinois Coreference Package, StanfordNLP, OpenNLP, Illinois  
> but I can't afford to try them all. :/
>
> I would be very happy for some suggestions!
>
> Kind regards,
> Daniel
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list