[Corpora-List] 'imperfect' corpora

Hunter, Duncan D.I.Hunter at warwick.ac.uk
Wed Nov 15 20:58:07 UTC 2006


Hi list members,
 
I have been given access to a large amount of  data, which has been OCR'd  into a digital (.txt file) format. The data is extremely valuable for a number of reasons and I would like to carry out, amongst other things, a Keyword analysis.  However,  test-runs with corpus investigation tools show that there are a few problems with the reliability of the corpus due to OCR errors (mis-copying and fragmentation of words over end-of-line boundaries, etc.). 
 
How can valuable but 'imperfect' corpus data be utilised effectively? Any tips as to how anomalous (but generally explicable) results can be described/accounted for in a principled, consistent manner?
 
 
Duncan Hunter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061115/ebfe29d4/attachment.htm>


More information about the Corpora mailing list