forum
Mia Kalish
MiaKalish at LEARNINGFORPEOPLE.US
Thu Feb 28 22:37:32 UTC 2008
We seem to be in wonderful techie world, which I love.
But my goal was to create a reasonably simple method that ordinary people
could use to document their language, and to create everyday documents. I
wanted fonts that were affordable on the Rez, where sometimes it's a choice
between food and gas.
I have people here who haven't mastered "URL" so if I had to teach them to
use outboard utilities to convert . . . oh my.
I love these codelets. I think I can reformat a machine somewhere for Linux,
although I like C better, and there is the Java option.
But they need to be embedded for People . . . simple people . . . (and some
days, trust me, I am a Very Simple People . . . :-) ).
Has anyone out there been using Unicode in actual revitalization projects?
Is it working for people? Do first speakers need someone else, someone with
some degree of technical experience, to create their documents for them?
Mia
-----Original Message-----
From: Indigenous Languages and Technology [mailto:ILAT at LISTSERV.ARIZONA.EDU]
On Behalf Of William J Poser
Sent: Tuesday, February 26, 2008 5:10 PM
To: ILAT at LISTSERV.ARIZONA.EDU
Subject: Re: [ILAT] forum
Mia,
The problem of the different encodings of the same character,
such as U+00E1 U+0328 vs. U+0105 U+0301 vs. U+0061 U+0328 U+0301
vs. U+0061 U+0301 U+0328 (the latter two with both of the diacritics
encoded separately) is handled in Unicode by the use of "normalization".
All four of the above encodings of "lower case letter a with acute accent
and subcript hook" are converted to a single canonical representation
in each of the normalizations defined by Unicode. When performing operations
like sorting, the proper procedure is to normalize the text before
sorting. There are a number of freely available libraries for performing
Unicode normalization. My sort utility msort
(http://billposer.org/Software/msort.html) knows all about Unicode and does
Unicode normalization before
sorting. (msort runs on pretty much any Unix variant and on Mac OS X.
It probably can be gotten to run on MS Windows, but I don't do Windows
myself and as far as I know nobody else has compiled it on MS Windows.)
Some of the widely used scripting languages
also have normalization libraries. Here are scripts that normalize
to normalization form "C" (known to its friends as NFC).
Here is a Unix shell script that calls Perl. It reads its input from the
standard input and writes the normalized output on the standard
output:
perl -CSD -e 'use Unicode::Normalize;
while ($line = <STDIN>){
print NFC($line)
}'
And here is the equivalent Python program:
#!/usr/bin/env python
import sys
import codecs
import unicodedata
(utf8_encode, utf8_decode, utf8_reader, utf8_writer) =
codecs.lookup('utf-8')
outfile = utf8_writer(sys.stdout)
infile=utf8_reader(sys.stdin)
outfile.write(unicodedata.normalize('NFC',infile.read()))
sys.exit(0)
For Java lovers:
//This program is a filter that reads input in UTF-8 Unicode
//and converts it to normal form C.
//To byte-compile: javac NFCUnicode.java
//To run: java NFCUnicode < <inputfilename> > <outputfilename>
//Note: this requires Java SE 6 (1.6.0) released December 11, 2006.
//Author: Bill Poser (wjposer at ldc.upenn.edu)
import java.util.*;
import java.io.*;
import java.text.Normalizer;
public class NFCUnicode {
public static void main(String[] args) {
String thisLine;
String Encoding = "UTF8";
try {
BufferedReader in = new BufferedReader(new
InputStreamReader(System.in,"UTF8"));
BufferedWriter out = new BufferedWriter(new
OutputStreamWriter(System.out,"UTF8"));
try {
while ((thisLine = in.readLine()) != null) {
out.write(Normalizer.normalize(thisLine,Normalizer.Form.NFC) + "\n");
}
}
catch (IOException e) {
System.err.println("Error: " + e);
System.exit(2);
}
}
catch (UnsupportedEncodingException e) {
System.err.println("Error: " + e);
System.exit(2);
}
System.exit(0);
}
}
Bill
More information about the Ilat
mailing list