new Cantonese corpus

Brian MacWhinney macw at cmu.edu
Mon Sep 3 02:49:45 UTC 2001


Dear Info-CHILDES,
  I am happy to announce the addition to CHILDES of a particularly
fascinating new corpus on the acquisition of Cantonese and English from
Virginia Yip and Stephen Matthews of the Chinese University of Hong Kong and
Hong Kong University.  This corpus is the second part of a pair of sample
files on Cantonese-English bilingual acquisition.  The first set from Timmy,
was contributed about three months ago and features audio linked to
transcripts which can be found at
http://childes.psy.cmu.edu/audio/HongKong/timmy/

The new corpus is from Timmy's younger sister Sophie and includes both audio
and video linked to transcripts.  For Sophie, there are two audio files in
Cantonese and one in English.  One of the Cantonese files is linked to an
MPEG video and the English file is linked to a QuickTime video.

The study is particularly interesting both for the quality of the audio and
video and for the insights it provides on bilingual acquisition.  Here is
the readme file for the new Sophie transcript.

Thanks to Virginia, Stephen, Timmy, and Sophie for this contribution.
Additional data will be added to both the Timmy and Sophie corpora over the
next year.

--Brian MacWhinney




The Hong Kong Bilingual Child Language Corpus: Longitudinal data for Sophie
(1;06.00-3;00.09)

Virginia Yip & Stephen Matthews
Chinese University of Hong Kong &University of Hong Kong

The corpus of Sophie¹s bilingual development is the second installment of
the Hong Kong Bilingual Child Language Corpus. Born on 28 February 1996,
Sophie is the younger sister of Timmy, the first bilingual subject to be
included in the Hong Kong bilingual corpus, born two years and nine months
earlier (a younger sister was born when she was 4;03). Sophie's mother is a
native speaker of Hong Kong Cantonese and her father of British English, and
her exposure to Cantonese and English started from birth. The one parent-one
language principle was adopted in principle, especially when addressing the
child, but code-mixing occurred when the parents conversed with each other,
which formed part of the child¹s input. Apart from parental input,
interaction with her brother took place in both Cantonese and English. She
was regularly video-taped and audio-taped by two research assistants in each
recording session, one responsible for each language, from 28 August 1997 to
28 February 2000 (1;06 - 4;00) on a weekly basis. In each recording session
one research assistant interacted with the child for approximately half an
hour in English and the other for half an hour in Cantonese.

The corpus as initially released covers transcriptions of her data from age
1;06 up to 3;00.09, on an approximately biweekly basis. Pictures chronicling
Sophie and Timmy at different stages from infancy to primary school can be
viewed at  http://www.cuhk.edu.hk/ils/home/bilingual.htm

Sophie lived in Hong Kong continuously throughout the period of recording.
She did not take her first trip abroad (to Australia) until 4;04. She was
cared for primarily by her maternal grandmother who spoke Cantonese and
ChiuChow and a Filipino domestic helper, Belma, who spoke English and some
Cantonese. She started attending a local Chinese kindergarten at 2;6 in the
morning and in addition, attended an English-speaking kindergarten in the
afternoon from 3;02. She continued to attend both schools until 5;01. The
kindergartens were each monolingual in the respective language.

While the circumstances are similar overall to those prevailing in Timmy's
case, Sophie¹s different personality and character lead to differences in
the data. While her brother was reserved and passive, she was typically
lively and talkative in recording sessions, even becoming argumentative as
she grew older. In addition, being cared for primarily by her grandmother
and remaining in Hong Kong exclusively during her preschool years means that
the predominance of Cantonese input is even greater in her case than in
Timmy's. This is reflected in the fact that while recordings eliciting both
languages are available from age 1;06.00, these early recordings are
dominated by Cantonese with occasional English words, and she only began to
produce English sentences after age 2. While in many respects her
development recapitulates that described for Timmy (such as wh in situ, null
objects and prenominal relative clauses: see Yip & Matthews 2000), her
English also shows some forms of transfer which are not evident in Timmy,
such as extension of the verb give to permissive and even passive usages.

Since her grandmother speaks Chaoyang (Chaozhou) dialect as well as
Cantonese, Sophie developed some passive knowledge of this dialect. She
learnt that producing occasional phrases in Chaozhou was a source of
amusement, but did not produce full sentences. There is also the possibility
of syntactic influence from Chaozhou, for example in the ordering of double
objects.

The parents kept a diary of Sophie¹s utterances to supplement the audio and
video-recording data. The diary continued beyond age 4, when regular
recording ceased, in order to follow up some of the features.

The format of the English and Cantonese data is as described for Timmy in
the first installment of the Hong Kong corpus: the grammatical category
labels for the English corpus are based on the MOR grammars for English in
the CHILDES Windows Tools, while the Cantonese data were tagged using a
program developed by Lawrence Cheung on the basis of the grammatical
categories used in the Hong Kong Cantonese child language corpus (Cancorp)
created by Lee et. al. (1996), which contains eight monolingual Cantonese
speaking children¹s data from 1;5 ­ 3;8. There are three tiers with the main
tier showing Cantonese in the JyutPing romanization, and Cantonese
characters and grammatical categories shown on separate tiers. The
thirty-three grammatical categories used for tagging the corpus are listed
below in Table 1. Details of the Morpheme tier (%mor) and Cantonese tier
(%can) as well as instructions for downloading and viewing the Cantonese
characters can be found in the readme file accompanying the data for Timmy.


(table goes here -omitted here and included in the next message)

Together with this corpus, a total of three sample audio-linked transcripts
and two video-linked transcripts are available for access. The three
audio-linked transcripts feature Cantonese and English as well as some
Cantonese-English code-switching. Two of the audio-linked transcripts have
video-linked counterparts. The shorter of these transcripts has a
video-linked counterpart, with a sound track that is less clear than in the
audio-linked one. In the shorter video excerpt (3:00) sound quality may be
improved by adjusting the balance to turn down the right channel. The
video-linked files feature each language respectively as the base language
as well as code-switching in the longer one (4:13).

Acknowledgments

Investigation of Sophie's bilingual development was undertaken as part of
the project "A Cantonese-English Bilingual Child Language Corpus" funded by
a grant from the Research Grants Council (RGC ref. no. CUHK4002/97H) to
Virginia Yip (Chinese University of Hong Kong and Stephen Matthews
(University of Hong Kong). We gratefully acknowledge the support and help of
the colleagues and students who have been friends and supporters of our work
over the years. Among them, special thanks are due to Huang Yue Yuan, Linda
Peng Ling Ling, Bella Leung, Lawrence Cheung, Simon Huang, Gene Chu, Betty
Chan, Chen Ee San, Shirley Sung, Emily Ma, Uta Lam, Richard Wong and Angel
Chan: a dedicated team who became part of the family and friends of the
children. Brian MacWhinney's impressive technical know-how and practical
tips have greatly facilitated the completion of the corpus and production of
the audio and video clips. His sabbatical in Hong Kong during 2000-2001 has
made all the difference to every aspect of the corpus.

Please cite:
Matthews, S.& V. Yip. (Forthcoming) Relative clauses in early bilingual
development: transfer and universals. In Giacalone, A. (ed.) Typology and
Second Language Acquisition. Mouton de Gruyter.

For this release, there is a total of 80 files, half in Cantonese and half
in English and there are two files for the same date since they were
recorded on the same day. Though there seems to be a perfect symmetry in
terms of the files in each language, it should be noted that in the early
English files before Sophie turned 2, she did not yet speak English fluently
despite the investigators¹ elicitation in English. The file name is made up
of Sophie¹s initial S, followed by the initial that stands for the language,
either c for Cantonese or e for English, followed by the year, month and
date of recording.e.g. Sc970828 refers to the Cantonese file containing the
recording made in the year 1997, August 28 and Se970828 refers to the
English file for the recording made on the same date. Thus each of the 80
files has a unique file name.

Inventory of Sophie's files

File no.    File name (Scyymmdd)    File no.    File name (Seyymmdd)    Age
of CHI
1.     Sc970828    41.    Se970828    1;06.00
2.     Sc970911    42.    Se970911    1;06.14
3.     Sc970925    43.    Se970925    1;06.28
4.     Sc971016    44.    Se971016    1;07.18
5.     Sc971030    45.    Se971030    1;08.02
6.     Sc971113    46.    Se971113    1;08.16
7.     Sc971127    47.    Se971127    1;08.30
8.     Sc971218    48.    Se971218    1;09.20
9.     Sc971230    49.    Se971230    1;10.02
10.     Sc980114    50.    Se980114    1;10.17
11.     Sc980205    51.    Se980205    1;11.08
12.     Sc980219    52.    Se980219    1;11.22
13.     Sc980305    53.    Se980305    2;00.07
14.     Sc980318    54.    Se980318    2;00.20
15.     Sc980403    55.    Se980403    2;01.06
16.     Sc980417    56.    Se980417    2;01.20
17.     Sc980501    57.    Se980501    2;02.01
18.     Sc980514    58.    Se980514    2;02.14
19.     Sc980529    59.    Se980529    2;03.01
20.     Sc980611    60.    Se980611    2;03.13
21.     Sc980622    61.    Se980622    2;03.24
22.     Sc980716    62.    Se980716    2;04.18
23.     Sc980724    63.    Se980724    2;04.26
24.     Sc980730    64.    Se980730    2;05.02
25.     Sc980813    65.    Se980813    2;05.16
26.     Sc980827    66.    Se980827    2;05.30
27.     Sc980909    67.    Se980909    2;06.12
28.     Sc980929    68.    Se980929    2;07.01
29.     Sc981008    69.    Se981008    2;07.10
30.     Sc981022    70.    Se981022    2;07.24
31.     Sc981105    71.    Se981105    2;08.07
32.     Sc981119    72.    Se981119    2;08.21
33.     Sc981203    73.    Se981203    2;09.05
34.     Sc981222    74.    Se981222    2;09.24
35.     Sc990107    75.    Se990107    2;10.10
36.     Sc990121    76.    Se990121    2;10.24
37.     Sc990202    77.    Se990202    2;11.05
38.     Sc990215    78.    Se990215    2;11.18
39.     Sc990302    79.    Se990302    3;00.02
40.     Sc990309    80.    Se990309    3;00.09



More information about the Info-childes mailing list