[Corpora-List] Named Entity Extraction from Noisy, Unstructured Texts

Diane M. Napolitano dmnapolitano at gmail.com
Thu Jun 18 21:32:09 UTC 2009


Hello, everyone!  I'm looking for information on named entity recognition
from documents that are almost completely unstructured and incredibly
messy.  I get a lot of documents that are basically text extracted from
PDFs, images, PowerPoint slides and the like, and the resulting text is
often missing a lot of formatting.  I've read a number of papers and I've
tried training a statistical package (Stanford) on data of this kind, but it
actually performs worse than if I had trained on clean, narrative data.
Right now, my group has a rule-based system that relies on gazetteer lists,
which only gets us so far...

Anyone have any insights they could provide? :)

Thanks!
Diane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090618/58b67ea2/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list