forum

William J Poser wjposer at LDC.UPENN.EDU
Wed Feb 27 00:09:39 UTC 2008


Mia,

The problem of the different encodings of the same character,
such as U+00E1 U+0328 vs. U+0105 U+0301 vs. U+0061 U+0328 U+0301
vs. U+0061 U+0301 U+0328 (the latter two with both of the diacritics
encoded separately) is handled in Unicode by the use of "normalization".
All four of the above encodings of "lower case letter a with acute accent
and subcript hook" are converted to a single canonical representation
in each of the normalizations defined by Unicode. When performing operations
like sorting, the proper procedure is to normalize the text before
sorting. There are a number of freely available libraries for performing
Unicode normalization. My sort utility msort (http://billposer.org/Software/msort.html) knows all about Unicode and does Unicode normalization before
sorting. (msort runs on pretty much any Unix variant and on Mac OS X.
It probably can be gotten to run on MS Windows, but I don't do Windows
myself and as far as I know nobody else has compiled it on MS Windows.)

Some of the widely used scripting languages
also have normalization libraries. Here are scripts that normalize
to normalization form "C" (known to its friends as NFC).

Here is a Unix shell script that calls Perl. It reads its input from the
standard input and writes the normalized output on the standard
output:

perl -CSD -e 'use Unicode::Normalize;
while ($line = <STDIN>){
    print NFC($line)
}'

And here is the equivalent Python program:

#!/usr/bin/env python
import sys
import codecs
import unicodedata
                                                                                 
(utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup('utf-8')
outfile = utf8_writer(sys.stdout)
infile=utf8_reader(sys.stdin)
outfile.write(unicodedata.normalize('NFC',infile.read()))
sys.exit(0)

For Java lovers:

//This program is a filter that reads input in UTF-8 Unicode
//and converts it to normal form C.
//To byte-compile: javac NFCUnicode.java
//To run:          java NFCUnicode < <inputfilename> > <outputfilename>
//Note: this requires Java SE 6 (1.6.0) released December 11, 2006.
//Author: Bill Poser (wjposer at ldc.upenn.edu)
 
import java.util.*;
import java.io.*;
import java.text.Normalizer;
 
public class NFCUnicode {
    public static void main(String[] args) {
        String thisLine;
        String Encoding = "UTF8";
        try {
            BufferedReader in =  new BufferedReader(new InputStreamReader(System.in,"UTF8"));
            BufferedWriter out = new BufferedWriter(new OutputStreamWriter(System.out,"UTF8"));
            try {
                while ((thisLine = in.readLine()) != null) {
                    out.write(Normalizer.normalize(thisLine,Normalizer.Form.NFC) + "\n");
                }
            }
            catch (IOException e) {
                System.err.println("Error: " + e);
                System.exit(2);
            }
        }
        catch (UnsupportedEncodingException e) {
            System.err.println("Error: " + e);
            System.exit(2);
        }
        System.exit(0);
    }
}

Bill



More information about the Ilat mailing list