Hello, everyone! I'm looking for information on named entity recognition from documents that are almost completely unstructured and incredibly messy. I get a lot of documents that are basically text extracted from PDFs, images, PowerPoint slides and the like, and the resulting text is often missing a lot of formatting. I've read a number of papers and I've tried training a statistical package (Stanford) on data of this kind, but it actually performs worse than if I had trained on clean, narrative data. Right now, my group has a rule-based system that relies on gazetteer lists, which only gets us so far...<br>
<br>Anyone have any insights they could provide? :)<br><br>Thanks!<br clear="all">Diane<br>