Corpora: non-alphabetic language databases

Mcenery, Tony eiaamme at exchange.lancs.ac.uk
Thu Nov 30 12:19:06 UTC 2000


Hi,

I agree with Thomas that Unicode is promising, at least in terms of encoding
the characters. However, rendering Unicode in a readable form can be a major
task. This is especially true for non-alphabetic writing systems, where glyphs
of the writing system may not actually be represented in Unicode, but are
instead generated by  a rendering engine. The available font rendering engines
which can actually display Unicode text accurately are few and far between. In
terms of corpus processing I am aware of no system which actually renders
Unicode accurately for all languages. Help is on the horizon:

1.) I believe Mike Scott is working on a version of Wordsmith which may both
read Unicode text and render it appropriately.
2.) I am currently working with the GATE team at Sheffield towards making a
version of GATE which renders a wide range of writing systems encoded in
Unicode, but it is laborious work.
3.) SIL international are developing a font rendering engine called Graphite
which should be able to be embedded in corpus processing systems.

So while I think Unicode is the way for corpus work to go in the future,
treading that path with non-alphabetic writing systems at this moment in time
is somewhat difficult.

T

> -----Original Message-----
> From:	Thomas Schmidt [SMTP:thomas.schmidt at uni-hamburg.de]
> Sent:	30 November 2000 12:00
> To:	corpora at hd.uib.no
> Subject:	AW: Corpora: non-alphabetic language databases
>
> The unicode standard is indeed a promising solution for representing
> non-alphabetic characters of any kind. Concerning the original question: I
> don't know much about sign languages, but I wouldn't be surprised if the
> unicode consortium has taken or will take these into account.If they don't,
> the design of the unicode standard leaves room for user-defined symbols, so
> it should be possible, for instance, to code alphabetic and sign language
> symbols within one document.
> The unicode homepage is on
>
> 	http://www.unicode.org/
>
> -----Ursprungliche Nachricht-----
> Von:	Simon G. J. Smith [SMTP:smithsgj at eee.bham.ac.uk]
> Gesendet am:	Donnerstag, 30. November 2000 12:34
> An:	corpora at hd.uib.no
> Betreff:	Re: Corpora: non-alphabetic language databases
>
>
> Paula
>
> Have a look at www.chinesecomputing.com
>
> Are you a student of one of these languages? Take a look at a website from
> one of the countries, without character-reading software running, and you
> will see that each character is represented by two ASCII characters -
> usually obscure things like ^ or ` and others that are not on the qwerty
> keyboard at all.
>
> My understanding is this: order of database entry is not based on any
> phonetic system, nor on any arrangement of radicals or character
> components, but on a standard (for Chinese, usually one of Big-5 or GB
> (Guo-Biao)) which maps each character on to an arbitrary pair of ASCII
> characters. With the advent of the Unicode standard, a one-to-one mapping
> is also now possible, but implementations are rare.
>
> I'm not an expert: perhaps there's one around who would care to add their
> comments?
>



More information about the Corpora mailing list