digitizing US newspapers

Jake Berg jsberg at gwu.edu
Tue Apr 12 13:51:39 UTC 2005


I have worked in digital preservation before; the process and software used
to digitize both newspapers and books in this country is geared towards
Romance and Germanic languages because those, and English in particular, are
the easiest to work with.
An image of the page being digitized is saved, usually in .tif format. OCR
software then scans and "strips" the image for letters. At the University of
Michigan, where I used to work, we had a 6 OCR filter, featuring built in
redundancy that would correct any misreads by the software.  That being
said, there are still some common problems with non-English language
materials. Accent marks routinely throw off the OCR software.  For example,
an umlaut over an 'a' or an 'o' often results in the letter being read as
two 'i's, and the OCR software will read 'cliche' as 'clich6' because the
mark over the 'e' makes it look like the number six.  The digitization
process is supposed to make these works text-searchable, but as a result of
the above examples and others I won't bore you with, the process is far from
perfect.  Although I think he overstateshis case, Nicholson Baker's book
"Double Fold" deals with some of these criticisms.
One very labor-intensive way to address these problems is to bring in
proofreaders who can check the OCR version against the digitized image and
correct the former, but having done a bit of this myself I can tell you that
after about half an hour your eyes start to glaze over and accuracy becomes
a problem.  Additionally, many of these digitization projects come from
grant money, and hiring proofreaders is not feasible.
In sum, OCR software reads English easier than other languages and when time
and cost are factored in as well it becomes more difficult to digitize works
languages other than English.

/Jake


------------------------------------------------
Jacob Berg
Teaching Assistant
Dept. of Political Science
609 21st Street, 2nd floor, #201
OFFICE HOURS:
Mon, 11am-12pm
Tues, 1-2pm, 3:30-4:30pm

----- Original Message -----
From: "Harold F. Schiffman" <haroldfs at ccat.sas.upenn.edu>
To: "Language Policy-List" <lgpolicy-list at ccat.sas.upenn.edu>
Sent: Tuesday, April 12, 2005 9:23 AM
Subject: digitizing US newspapers


> Dear list-members,
>
> The following item from the Chronicle for Higher Education mentions a
> grant program from the US National Endowment for the Humanities for
> digitizing newspapers from the early 20th century.  When you read the fine
> print (go to http://www.neh.fed.us/projects/ndnp.html) you see that this
> is only for newspapers in ENGLISH.  My concern and question is whether
> anyone has experience with using newspapers in languages other than
> English printed in the US during the 19th and 20th centuries, and what the
> prognosis might be for preservation of those materials?
>
> My experience has been that these materials are often in atrocious
> shape--neglected, ignored, denigrated, and often sent for recyling when
> space needs become more important than preservation.  I do note that an
> earlier program, the United States Newspaper Program
> (http://www.neh.fed.us/projects/usnp.html)  mentions some non-English
> newspapers, e.g. a Texas project, "which includes the newspapers of
> Jewish, Czech, and German settlers" and newspapers in Hawaiian (in Hawaii)
> and in Cherokee (in Oklahoma) but in general, the notion conveyed is the
> usual anglo-centric one, i.e. "all we need to know about US history can be
> learned through English."
>
> Is anybody besides me concerned about this issue enough to try to mount a
> grant proposal to save non-English papers on a larger scale?
>
> Hal Schiffman
>
> ---------- Forwarded message ----------
>
> http://chronicle.com/daily/2005/04/2005040501t.htm
>
> Tuesday, April 5, 2005
>
> 4 Universities Will Help Digitize Newspapers From the Early 20th Century
> By DAN CARNEVALE
>
> Washington
>
> Four universities and two public libraries are sharing $1.9-million in
> grants to digitize newspapers from the beginning of the 20th century so
> the publications can be preserved and searched online.
>
> The two-year grants were announced on Monday by the Library of Congress
> and the National Endowment for the Humanities as part of the National
> Digital Newspaper Project, a new program that will eventually preserve old
> newspapers from all over the country in digital form.
>
> With the money, each institution will digitize 100,000 or more pages from
> the most historically significant newspapers published in its state
> between 1900 and 1910. The digital copies will then be available free on
> the Library of Congress's Web site.
>
> The grant recipients and their awards are as follows:
>
>
> Library of Virginia, $201,226.
>
> New York Public Library, $351,500.
>
> University of California at Riverside, $400,000.
>
> University of Florida Libraries, Gainesville, $320,959.
>
> University of Kentucky Research Foundation, $310,000.
>
> University of Utah, $352,693.
>
> "The Library congratulates these institutions for taking a leading role in
> making newspapers -- among our richest records of history -- available
> electronically through our Web site," James H. Billington, librarian of
> Congress, said in a written statement. "We hope the National Digital
> Newspaper Program inspires other institutions to make their public-domain
> newspapers accessible online."
>
> The program's goal is to digitize every historically significant newspaper
> from every U.S. state and territory from 1836 to 1922. Officials say the
> entire program will take about 20 years.
>
> "Newspapers are among the most important historical documents we have as
> Americans. They tell us who we were, who we are, and where were going,"
> said Bruce Cole, chairman of the humanities endowment. "Students,
> historians, lawyers, politicians -- even newspaper reporters -- will be
> able to go to their computers at home or at work and through a few
> keystrokes get immediate, unfiltered access to the greatest source of our
> history. It will be available to the American public for free, forever."
>
> Andrea Vanek, a librarian for UC-Riverside, is the assistant director of
> the California Newspaper Project, which has been locating and cataloguing
> microfilm of old newspapers under another government program. With the new
> grant, she said in an interview, the project will be able to expand its
> role to include digitizing small newspapers from all over the state.
>
> "We do want to represent the whole state," Ms. Vanek said. "It's going to
> create such a resource for users throughout the world."
>
>
>
> --------------------------------------------------------------------------------
> Copyright  2005 by The Chronicle of Higher Education
>
> Subscribe | About The Chronicle | Contact us | Terms of use | Privacy
> policy | Help
>



More information about the Lgpolicy-list mailing list