New Dutch corpus

Brian MacWhinney macw at cmu.edu
Sun Feb 1 21:58:35 UTC 2004


Dear Info-CHILDES,
  I am happy to announce the addition to CHILDES of a new corpus of
transcripts from four children aged 4;9-5 learning Dutch.  This corpus was
contributed by Annick De Houwer.  One emphasis in the study is on features
unique to the Antwerp dialect.  Audio is available for the corpus although
the audio has not yet been linked to the transcripts.  The complete
documentation for the corpus is as follows:

****

This corpus of Dutch child language and child-directed speech was collected
in Antwerp, Belgium. Transcription and coding of the Antwerp Dutch corpus
was made possible through grants to the author from the Belgian Science
Foundation and the University of Antwerp.

The corpus consists of 15 recordings transcribed orthographically and
phonetically. Some transcripts also contain variety codes, speaker codes,
addressee codes and utterance numbers (see further below). Participants are
four children between the ages of ca. 4;9 and 5;0 (two boys Dieter and
Michiel, and two girls Kim and Katrien) and their families, with some other
persons on occasion present as well. The families are lower-middle to
middle-middle class. All children are addressed in some form of Dutch common
around the city of Antwerp and go to school fulltime (second year of nursery
school). They are being raised monolingually. The interactions are mostly
free and spontaneous, but include some structured interactions as well, in
which the mother or father had a conversation with the 4-year-old about the
past day at school, or prompted the child to describe a picture and tell a
picture book story.

The transcripts consist of 13,602 utterances (children and adults combined).
Both adult and child utterances were phonetically and orthographically
transcribed by three separate coders: the first two made a transcript from
scratch, and the third resolved any differences between the two. For each
transcript there was at least one coder from the Antwerp area, and one coder
not from the Antwerp region. Phonetic transcription was originally carried
out in Dutch UNIBET as developed by Steven Gillis, and is fairly narrow,
especially as regards vowel sounds. However, prosody was not transcribed.

As most recently described in Nuyts (1989), Antwerp vowel phonemes differ
quite substantially from standard Dutch phonemes both in their type and in
their distribution. The Dutch UNIBET system first used for the phonological
transcription could not handle all the phonemes. Rather than develop a new
system, approximations were used where necessary, with an explanation in a
following %exp line of how a particular phoneme symbol was best interpreted.

The UNIBET symbols were converted in Unicode but researchers who prefer to
work with the original UNIBET files are welcome to contact the author of the
data for more information. Also, there remain 0Xfa symbols in the Unicode
for sounds that could not be approximated with the UNIBET symbols. Finally,
the files for the child MICHIEL may contain some inaccuracies on the %pho
line with regard to the long low open vowel phoneme used in Antwerp
renderings of HIJ, MIJN and the like. Researchers wanting to work with these
data are welcome to contact the author of the data to resolve these
problems.

While Dutch standard spelling was generally used, the orthographic
transcript stays as close to the phonetic transcript as possible, and
indicates missing initial and final sounds between brackets. Where this is
not the case, and there seems to be a mismatch between the phonetic and
orthographic transcript lines, it is the phonetic line that should be taken
as most closely resembling the original utterance. Utterance lines may be
followed by comment lines. These are in Dutch.

For 10 of the 15 data files there is an additional coding line for each
utterance (5 of these are complete and double-checked; the other 5 are
provisional). This line includes the following: - an utterance number
followed by a slash - a three letter code, where the first letter refers to
the speaker, the second letter refers to the kind of Dutch that is being
used (variety neutral, or 'local', meaning that the utterance contained a
form typical of Antwerp dialect), and the third letter refers to the
addressee. More information on these codes can be found in De Houwer, 2003
(reference below), or can be obtained directly from the author of these data
at annick.dehouwer at ua.ac.be. If the coding line indicated that the utterance
contained material coded as 'local', an explanation line follows to identify
what exactly it was in the utterance that led to that coding decision (e.g.,
a particular dialect phoneme, use of a dialect pronoun, use of specific
dialect vocabulary, etc. - see De Houwer 2003).

The data show that the following distinctions in usage emerge: 'local'
utterances containing dialect elements tend to be used when older children
and adults in the family address each other. 'Neutral' forms that are common
all over Flanders may also be used, while 'distal' features, which are clear
'imports' from a Dutch variety outside Flanders are being avoided. However,
when older children and adults address the younger members of the family,
they increase their use of neutral forms, substantially reduce their use of
local forms, and occasionally use distal forms. The younger children use
mainly utterances categorized as neutral, dependent on who they are
addressing. Implications of this variation across family members for
language change are discussed. (Reference: Nuyts, Jan. (1989). Het Antwerps
vokaalsysteem: een synchronische en diachronische schets. Taal en tongval
41(1-2): 22-48.)



Researchers wishing to use these data should cite this publication:



De Houwer, Annick (2003). Language variation and local elements in family

discourse. Language Variation and Change 15: 327-347.



More information about the Info-childes mailing list