[Corpora-List] the ebb and flow of inclusion of words in OED?

Martin Reynaert reynaert at uvt.nl
Tue Apr 26 14:28:23 UTC 2011


> It would be interesting to to do a more detailed study of word
> creation and disuse by going back to the original documents,
> when more of them become digitized.
>
> John Sowa
>
Dear John,

Just as a general note of warning on this... The examples are Dutch, but 
sobering nevertheless.

The Dutch National Library is putting online 8 million pages of 
digitized newspapers. It is a delightful collection going back to 1618, 
available for free to all.

If you go to

http://kranten.kb.nl/

and type in the query 'atoomschip' or even 'atoomtram' you will get a 
range of hits from about 1900 to 1927. The terms translate as 'nuclear 
ship' and 'nuclear tram'...

These are, of course, simple 's' to 'a' OCR-misrecognition errors, steam 
ships ('stoomschepen') and steam trams ('stoomtrams') being common at 
the time ;0) Also an example of real-world real-word errors far more 
interesting than the 20 or so word confusion sets being used in most 
research on context-sensitive spelling correction research.

Yours,

Martin


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list