searching CDs in Microfilm to CD conversion.

Pat Warren warr0120 at umn.edu
Tue Jan 13 07:07:16 UTC 2004


Carolyn and Jimm,

Short:

1. Even 25-50% ocr accuracy saves time.
2. By doing a good bibliographic description and generating web pages,
contents pages, and indexes from it, even just scanned images are more
accessible than a printed source.
3. Scans from microfilm are better (higher resolution) than copies from
microfilm. Copies from paper might be better than scans from film. Scans
from film allow image manipulation with imaging software that can make them
more legible. Digitalization preserves content that can be lost as original
sources decay, so sometimes the film is better than the original.
4. Dorsey's Dhegiha slips are only one source. Invest energy in the
process, not money in the service.

Long:

> 1. Since the materials will be just scanned and probably no OCR will
work,  (Carolyn)

Having not yet seen the film, I can't say how well ocr would work. I always
hold out hope and prefer to try it, and with ocr, even 25% or 50% accuracy
saves a lot of time, I've found.

> would we somehow be able to search the materials more easily than paper
> copies of same? (At the very least perhaps via cataloguing tracks laid in
> the CDs?) (Carolyn)

The way I do my work is to write a very detailed descriptive bibliography
of the source I'm digitizing. In going from microform to digital I write
two: one describing the structure (as much as possible) of the original
document(s), and another describing the structure of the microform itself.
I then create web pages for the images from the source, one page per image
with navigation buttons so you can flip through the images in order, one
page forward or backward, or ten pages forward or backward (but the buttons
only appear if the bibliography says there IS a +10 page, etc.). There's
also a table of contents web page (accessible from any individual page) and
the contents are generated automatically from the bibliographies, written
in xml. For microform I can also set up two versions of web pages. For a
film like the Dorsey Dhegiha slips, I'd create one web version to represent
the original source, with one slip per page. Then I'd create another web
version to represent the film itself, with one frame of microfilm per page.
So in a way I'd consider myself to be creating a reproduction of two
different sources at the same time, though one is noted as a itself derived
from a reproduction. I didn't do this with iapi oaye because there was
always only one original page per microfilm frame, but I would still want
the two different bibliographies available.

Even when you're dealing with large, unstructured documents you can easily
create many ways to organize the contents page to make it more accessible.
As long as you do a real good bibliographical description and have solid
naming conventions for files. As an example, when I did a bibliographical
description of Pentland and Wolfart's Bibliography of algonquian
linguistics (1982), my xml file lists where logical sections as well as
physical sections begin and end. So the order of pages is there, the order
of general sections like front matter, introductory matter, the
bibliography itself, and index matter. But I also record where the letters
of the alphabet start and end in the bibliography, as though they were
chapters. Even though they're not explicity marked, the bibliography is
organized alphabetically by author. In making explicit in the bibliography
this type of latent data content you automatically make the images, even
without conversion (yet) to text, way more accessible (though without a
laptop not as all-terrain) than the book.

> 2. Would we be able to distinguish characters better than on paper
copies? (Carolyn)

Probably, because most people's copies probably come from the film, and the
scans from the film would come at a higher resolution than copies from the
film. Though microfilming introduces lots of noise, and is pretty low
resolution (200dpi versus the 500dpi that I use), so copies from the
original might (depending on the copier used) end up better quality than
scanning from the microfilm reproduction. And if you make copies from the
paper after the micrfilm was produced, you may miss out if the paper has
decayed in some way.

But one option that digitizing offers that's good for mandwritten materials
is the endless array of modifications you can do in imaging programs that
can help you see better what's there.

As one example, In scanning the new edition of Buechel's dictionary the
back of the pages were often showing through because of the paper quality.
So I scanned it all in grayscale. Since the back of the page showed through
lighter than the printing on the page being scanned, it was recorded as
grey while the text on the scanned back was scanned as black. When I then
saved the image as black and white, it just left out the grey and the
"bleed-through" disappeared. This has now worked well with older books that
the same problem in scanning, and even for some actual ink bleed-through.
So the scan of the page does sometimes turn out more legible than the page
itself. This doesn't work with microform though (not yet anyway), because
the hardware itself only sees in monochrome (black only). But other imaging
software modifications might make some images easier to read (but these
modifications wouldn't be done to the archived images, you'd have to do it
yourself).

> If the answer to either of these question is positive, then it would be
> worth it to contribute funds to have the material on CDs and acquire a
set
> of CDs, even if we already have a paper version of the material, it seems
to
> me. (Carolyn)

> It seems the Dhegihanist have the day.  Nevertheless, the microfilm that
I
> have for Dorsey are from his much smaller contribution on Jiwere/
Chiwere. (Jimm)

Don't get too excited about the Dorsey Dhegiha slips. It's only one source.
Whether or not other people realize it, full digitization is going to
happen. Once you actually see the whole process, from scanning to full
hyperlinked, combined, searchable texts, it's very clear. Spend money on it
if you want. But if you feel compelled to spend money, why not invest in
the process rather than just paying for a service? Don't be too
shortsighted on this and waste time, money, and excitement. There's a lot
of stuff that should be done.

My idea of complete digitization: bibliography, images, plain text, coded
text, various display options, and derivative works. If people invest in
the process of how to go about this, and how to do this INSIDE the field,
with very little money involved, this can make cooperation much easier. Let
people with different interests, skills, time, and degrees of compulsion do
what they want, but combine the work together in a well designed, public
process. But you have to start investing in that internal process sometime,
even if it means just investing your interest and curiosity there, instead
of investing money outside the field.

Pat



More information about the Siouan mailing list