[Corpora-List] the ebb and flow of inclusion of words in OED?
Martin Reynaert
reynaert at uvt.nl
Tue Apr 26 14:28:23 UTC 2011
> It would be interesting to to do a more detailed study of word
> creation and disuse by going back to the original documents,
> when more of them become digitized.
>
> John Sowa
>
Dear John,
Just as a general note of warning on this... The examples are Dutch, but
sobering nevertheless.
The Dutch National Library is putting online 8 million pages of
digitized newspapers. It is a delightful collection going back to 1618,
available for free to all.
If you go to
http://kranten.kb.nl/
and type in the query 'atoomschip' or even 'atoomtram' you will get a
range of hits from about 1900 to 1927. The terms translate as 'nuclear
ship' and 'nuclear tram'...
These are, of course, simple 's' to 'a' OCR-misrecognition errors, steam
ships ('stoomschepen') and steam trams ('stoomtrams') being common at
the time ;0) Also an example of real-world real-word errors far more
interesting than the 20 or so word confusion sets being used in most
research on context-sensitive spelling correction research.
Yours,
Martin
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list