[Corpora-List] Any research on long named-entities
Roman Klinger
roman.klinger at scai.fhg.de
Tue Dec 9 08:49:34 UTC 2008
Hi,
Alexandre Rafalovitch wrote:
> I am looking for any research on recognizing long named entities
> (mostly organisational bodies). When I say long, I mean 10-20 tokens
> in length, rather than more frequently discussed 5-7. A short-ish
> example of such a name would be "the United Nations Educational ,
> Scientific and Cultural Organization". Yes, that's names with commas,
> conjunctions and other tokens that are normally excluded.
>
> I suspect legal and biological domains would be closest in their need,
> but so far I have failed to find an especially relevant paper.
>
We have done some work on IUPAC Names, longish systematical chemical
names. In our training corpus, each entity has an average of 31 tokens
(splitting hivens, brackets which are very frequent in these entities).
In comparison, in the BioCreative 2 corpus each Gene/Protein entity has
1.8 tokens in average (but with another tokenization).
Some special problem in these IUPAC names are wrongly inserted spaces in
the names, nevertheless, a feature (in a linear-chain CRF) detecting
leading or following white space solved some of the main problems. For
boundary problems (which occurred) I inserted a feature for detecting
frequently occuring last tokens in entities.
Perhaps there are some similarities to your work, you find my paper here:
http://dx.doi.org/10.1093/bioinformatics/btn181
Regards,
Roman
--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fhg.de/klinger.html
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list