brave against the enemy and unicode
Pat Warren
warr0120 at umn.edu
Mon Jan 26 19:52:14 UTC 2004
On 26 Jan 2004, Koontz John E wrote:
> On Mon, 26 Jan 2004, Pat Warren wrote:
> > > Very nice! And nice looking, too. I take it that the text format
> > > material is the OCR version?
I thought I'd mention that a nice look is VERY important to me. Too many
great projects get funding to develop databases and have no money or
interest in a user interface that's comprehensible or a pleasure to use. I
think both are essential if anybody is going to actually use it.
> > Yes, it's a slightly proofed version of the ocr results.
>
> It might be worth somehow pointing out in the modes what the difference
is
> between image and text, in terms of implications. The words carry the
> meaning, of course, but there's a certain "imagish" quality to the text
as
> it appears, perhaps due to the font properties, and it might help to be
> explicit.
Yeah, that's important too. The help functions (tooltips and help pages)
will explain what's going on there. The TEXT version is a full text version
formatted like the original source, that is, with its PHYSICAL structure
(page breaks, page layout, typefaces, etc.). And the DATA version (these
names are not set in stone) is a full text version that matches the LOGICAL
organization of the original (chapters, letters of the alphabet for
dictionaries...).
> I'm impressed that the OCR software can handle more than the ASCII
> character set. In fact, given your choice of fonts, I assume it might be
> able to handle essentially arbitrary characters? The Microsoft extended
> set is extensive, but missing some critical combinations for Siouanists.
Oh John, I'm sorry to break the news, but your computer knowledge is
becoming outdated! It was amazing for me when I read the article you and
David did in "Making dicitonaries" about the technological history of the
Camparative Siouan Dicitonary project. Things have changed so much, and
they're about to change so much more.
OCR software can work with any writing system you throw at it. Because you
can train the software character by character and tell it to recognize
several characters together if you like (arabic, chinese, eqyptian
hieroglyphs). For non-ASCII characters I tell it to print the unicode code
in the output rather than any the character itself, so I don't have to mess
with font issues. You can even recognize characters that don't exist in any
font yet, just give them a code. OCR has nothing to do with fonts
basically, so you're not restrained by what fonts you have. It's pattern
recognition, not matching printed characters with fonts on your computer.
> I also wonder about the potential of the font for use with Siouan
> languages generally, at least in terms of modern "scholarly usage" and
> perhaps for older symbol sets. Clearly the disadvantage of specialized
> solutions like the Standard Siouan set I've prepared with the SIL
software
> is that it doesn't use Unicode encoding.
Yes, all those cumbersome and idiosyncratic fonts are not really the best
way to go anymore. On the one hand, creating a new font is ridiculously
easy now, there's cheap easy to use software. And you can append characters
to any font you want. But unicode offers such a wide range of characters
and combining diacritics anyway, though it'll take a few years before
there's more fonts like Code2000 that support the full unicode range
(95,221 characters so far). I plan on developing a font for the project
that can be added to as needed, and will be free. But remember that as far
as anything printed, it can be OCR'd just fine - you don't have to HAVE the
quirky font for the OCR to recognize the characters consistently, you just
tell the software what to call the character. And if it's already a text
file, you can do find and replace to automatically convert to unicode in an
instant.
It's really worth doing some reading on the Unicode website:
http://www.unicode.org/
And here's another very useful font site:
http://www.identifont.com/
Ciao,
Patrick
More information about the Siouan
mailing list