[Corpora-List] Any research on long named-entities

John Burger john at mitre.org
Tue Dec 9 16:41:31 UTC 2008


Alexandre Rafalovitch wrote:

> I am looking for any research on recognizing long named entities
> (mostly organisational bodies). When I say long, I mean 10-20 tokens
> in length, rather than more frequently discussed 5-7. A short-ish
> example of such a name would be "the United Nations Educational ,
> Scientific and Cultural Organization".

Sorry, no particular research to offer.  But, for what it's worth,  
there are plenty of examples of these in Wikipedia, possibly useful as  
training data, e.g.:

Office of the Commissioner of the Ministry of Foreign Affairs of the  
People's Republic of China in the Hong Kong Special Administrative  
Region
(http://tinyurl.com/5htdfw)

United Nations Convention Against Illicit Traffic in Narcotic Drugs  
and Psychotropic Substances
(http://tinyurl.com/ytnz7a)

Because of the fairly careful naming conventions in Wikipedia, I think  
simple heuristics could distinguish these from other long article  
titles that are not names, e.g.:

List of individual Chinese Basketball Association scoring leaders by  
season
(http://tinyurl.com/57kehg)

And by following the back links from the articles, you can find  
samples of these entities in context, essentially free annotated  
data.  For instance, the UN Convention is referenced here:

http://en.wikipedia.org/wiki/Cocaine#Current_Prohibition

Again, for what it's worth.

- John D. Burger
   MITRE


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list