[Corpora-List] Any research on long named-entities
John Burger
john at mitre.org
Tue Dec 9 16:41:31 UTC 2008
Alexandre Rafalovitch wrote:
> I am looking for any research on recognizing long named entities
> (mostly organisational bodies). When I say long, I mean 10-20 tokens
> in length, rather than more frequently discussed 5-7. A short-ish
> example of such a name would be "the United Nations Educational ,
> Scientific and Cultural Organization".
Sorry, no particular research to offer. But, for what it's worth,
there are plenty of examples of these in Wikipedia, possibly useful as
training data, e.g.:
Office of the Commissioner of the Ministry of Foreign Affairs of the
People's Republic of China in the Hong Kong Special Administrative
Region
(http://tinyurl.com/5htdfw)
United Nations Convention Against Illicit Traffic in Narcotic Drugs
and Psychotropic Substances
(http://tinyurl.com/ytnz7a)
Because of the fairly careful naming conventions in Wikipedia, I think
simple heuristics could distinguish these from other long article
titles that are not names, e.g.:
List of individual Chinese Basketball Association scoring leaders by
season
(http://tinyurl.com/57kehg)
And by following the back links from the articles, you can find
samples of these entities in context, essentially free annotated
data. For instance, the UN Convention is referenced here:
http://en.wikipedia.org/wiki/Cocaine#Current_Prohibition
Again, for what it's worth.
- John D. Burger
MITRE
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list