parsed English data

Brian MacWhinney macwhinn at
Sat Jul 28 05:52:01 UTC 2001

Dear Info-CHILDES,

   Using a parsing program based on earlier work by Alon Lavie at the CMU
Linguistic Technologies Institute, Kenji Sagae has begun to parse the
English CHILDES corpora.  This work relies first on the successful running
of the MOR program for tagging of part of speech and then uses a
pseudo-unification process to produce LFG (Lexical-Functional Grammar)
parsings for CHAT files.  The parser is written in LISP and is not included
in CLAN.  The output of the program is a %syn line, which is virtually
impossible to read unless you are a LISP programmer who has acquired a
module for parsing nested parentheses.
   CLAN on Windows (but not Mac) includes a utility for reading these %syn
lines in an easily understood tree form exactly as in the trees of Windows
Explorer.  To run this utility, all you need to do is to triple-click on any
%syn line in a CLAN file and the viewer will open.  You can then also use
this viewer to march through the %syn lines in a file one by one and can
press the "good" or "bad" buttons to decide whether or not you accept the
parses yielded by the program.
  I have placed an initial sample of the output of the program at
If people interested in syntactic development could take a look at this and
provide Kenji and me with feedback, we would be most appreciative.
  We have not yet written CLAN programs to search these nested paren
stuctures, but that is a logical next step.  There is a discussion in the
manual of how to use COMBO for this, but this is the first time that the
relevant data structures have been available, so I am guessing that more
will be needed.
  Kenji will be producing a larger corpus soon, but these first four files
provide good examples of the shape of the data.  Eventually, when more files
are available, I will put them in a separate folder on CHILDES.

--Brian MacWhinney

More information about the Info-childes mailing list