finding information on CLAN utility programs

Brian MacWhinney macw at cmu.edu
Fri Mar 5 01:49:16 UTC 2004


Dear Diane
  Here is a brief description.  As you will see, some are used so
infrequently that I have even forgotten exactly what they do.  However,
others are quite handy.
I will put this into the manual, after I clarify a few fuzzy spots.

--Brian

COMBTIER  corrects a problem that typically arises when transcribers create
several %com lines.  It combines two %com lines into one by removing the
second header and moving the material after it into the tier for the first
%com.

CP2UTF converts code page ASCII files into UTF-8 Unicode files. If there is
an @Font tier in the file, the program uses this to guess the original
encoding.  If not, it may be necessary to add the +o switch to specify the
original language, as in +opcct for Chinese traditional characters on the
PC.  If the file already as an @UTF8 header, the program will not run.  The
+c switch uses the unicode.cut file in the Library directory to effect
translation of ASCII to Unicode for IPA symbols, depending on the nature of
the ASCII IPA being used. For example, +c3 does a translation from IPAPhon.
The +t at u switch forces the IPA translation to affect main line forms in the
text at u format.

DATACLEAN is no longer used.

DELIM inserts a period at the end of every main line if it does not
currently have a final delimiter.  It can also be used to insert final
periods on other lines.

Dos2Unix converts the carriage returns in DOS files to Unix style carriage
returns.

FIXCA is used to change the overlap markers in CA files to raised and
lowered form.

FIXIT is used to break up tiers with multiple utterances into standard
format with one utterance per main line.

INSERT creates the new XML-oriented @ID headers.  It uses information in the
@Participants and @Languages line, as well as @Sex, @Age, and @Group to
construct these lines.  The name of the corpus should be given by using the
+c switch, as in +cbrown.

LONGTIER removes line wraps on continuation lines so that each main tier and
each dependent tier is on one long line.  It is useful what cleaning up
files, since it eliminates having to think about string replacements across
line breaks.

LOWCASE is used to fix files that were no transcribed using CHAT
capitalization conventions.  Most commonly, it is used with the +c switch to
only convert the initial word in the sentence to lowercase.  To protect
certain proper nouns in first position from the conversion, you can create a
file of proper noun exclusions.

ORT is used to convert HKU style disambiguated pinyin to CMU style in
preparation for MOR.

REPEAT inserts [/] markers between repeated words.

RETRACE produces a %ret line which codes consecutive utterance line
repetitions according to CHAT format  (I'm not actually sure what this does
-- Brian)

TIERORDER puts the dependent tiers into a consistent alphabetical order.

UNIQ gives a list of all the words in the file with no frequency count.

UTF2CP converts UTF8 files back to code page format.  It only works on
Windows machines and is seldom needed.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20040304/f71b4c25/attachment.htm>


More information about the Chibolts mailing list