[Corpora-List] Any research on long named-entities

Tue Dec 9 08:49:34 UTC 2008

Hi,

Alexandre Rafalovitch wrote:
> I am looking for any research on recognizing long named entities
> (mostly organisational bodies). When I say long, I mean 10-20 tokens
> in length, rather than more frequently discussed 5-7. A short-ish
> example of such a name would be "the United Nations Educational ,
> Scientific and Cultural Organization". Yes, that's names with commas,
> conjunctions and other tokens that are normally excluded.
>
> I suspect legal and biological domains would be closest in their need,
> but so far I have failed to find an especially relevant paper.
>   

We have done some work on IUPAC Names, longish systematical chemical 
names. In our training corpus, each entity has an average of 31 tokens 
(splitting hivens, brackets which are very frequent in these entities). 
In comparison, in the BioCreative 2 corpus each Gene/Protein entity has 
1.8 tokens in average (but with another tokenization).

Some special problem in these IUPAC names are wrongly inserted spaces in 
the names, nevertheless, a feature (in a linear-chain CRF) detecting 
leading or following white space solved some of the main problems. For 
boundary problems (which occurred) I inserted a feature for detecting 
frequently occuring last tokens in entities.

Perhaps there are some similarities to your work, you find my paper here:
http://dx.doi.org/10.1093/bioinformatics/btn181

Regards,
Roman

-- 
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fhg.de/klinger.html

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora