<HTML>
<HEAD>
<TITLE>Re: finding information on CLAN utility programs</TITLE>
</HEAD>
<BODY>
<FONT FACE="Verdana">Dear Diane<BR>
Here is a brief description. As you will see, some are used so infrequently that I have even forgotten exactly what they do. However, others are quite handy.<BR>
I will put this into the manual, after I clarify a few fuzzy spots.<BR>
<BR>
--Brian<BR>
<BR>
</FONT><FONT FACE="Helvetica">COMBTIER corrects a problem that typically arises when transcribers create several %com lines. It combines two %com lines into one by removing the second header and moving the material after it into the tier for the first %com.<BR>
<BR>
CP2UTF converts code page ASCII files into UTF-8 Unicode files. If there is an @Font tier in the file, the program uses this to guess the original encoding. If not, it may be necessary to add the +o switch to specify the original language, as in +opcct for Chinese traditional characters on the PC. If the file already as an @UTF8 header, the program will not run. The +c switch uses the unicode.cut file in the Library directory to effect translation of ASCII to Unicode for IPA symbols, depending on the nature of the ASCII IPA being used. For example, +c3 does a translation from IPAPhon. The +t@u switch forces the IPA translation to affect main line forms in the text@u format.<BR>
<BR>
DATACLEAN is no longer used.<BR>
<BR>
DELIM inserts a period at the end of every main line if it does not currently have a final delimiter. It can also be used to insert final periods on other lines.<BR>
<BR>
Dos2Unix converts the carriage returns in DOS files to Unix style carriage returns.<BR>
<BR>
FIXCA is used to change the overlap markers in CA files to raised and lowered form.<BR>
<BR>
FIXIT is used to break up tiers with multiple utterances into standard format with one utterance per main line.<BR>
<BR>
INSERT creates the new XML-oriented @ID headers. It uses information in the @Participants and @Languages line, as well as @Sex, @Age, and @Group to construct these lines. The name of the corpus should be given by using the +c switch, as in +cbrown.<BR>
<BR>
LONGTIER removes line wraps on continuation lines so that each main tier and each dependent tier is on one long line. It is useful what cleaning up files, since it eliminates having to think about string replacements across line breaks.<BR>
<BR>
LOWCASE is used to fix files that were no transcribed using CHAT capitalization conventions. Most commonly, it is used with the +c switch to only convert the initial word in the sentence to lowercase. To protect certain proper nouns in first position from the conversion, you can create a file of proper noun exclusions.<BR>
<BR>
ORT is used to convert HKU style disambiguated pinyin to CMU style in preparation for MOR.<BR>
<BR>
REPEAT inserts [/] markers between repeated words.<BR>
<BR>
RETRACE produces a %ret line which codes consecutive utterance line repetitions according to CHAT format (I'm not actually sure what this does -- Brian)<BR>
<BR>
TIERORDER puts the dependent tiers into a consistent alphabetical order.<BR>
<BR>
UNIQ gives a list of all the words in the file with no frequency count.<BR>
<BR>
UTF2CP converts UTF8 files back to code page format. It only works on Windows machines and is seldom needed.<BR>
<BR>
<BR>
</FONT>
</BODY>
</HTML>