New Bilingual Corpus

Mon May 28 07:23:39 UTC 2001

Dear Info-CHILDES,

  I am happy to announce the addition to CHILDES of a large new longitudinal
corpus of data on the bilingual acquisition of English and Cantonese by
Virginia Yip and Stephen Matthews' son Timmy.  Timmy celebrated his 8th
birthday last week, and we just missed getting out this new corpus in his
honor.  However, happily, today is his sister Alicia's one-year-old birthday
and we are happy that we can now publicly make the corpus available in her
honor and in Timmy's.  In fact, Stephen and Virginia expect to be able to
add data from Alicia and several other bilingual English and Cantonese
children in the near future.

  The corpus can be retrieved from http://childes.psy.cmu.edu in the
bilingual folder.  It is called hongkong.zip and hongkong.sit.  The files
use Cantonese encoding for the %can line and there are instructions in the
readme about how to properly view Cantonese characters under Windows.  The
files also include a full %mor line using a tag set that is fully described
in the readme file.  (The formatting will be prettier in Word or in the full
database manual, of course.)

  This is now the largest longitudinal database for bilingual acquisition in
CHILDES.  Moreover, Stephen and Virginia have produced three sample audio
files that are linked to transcripts.  These sample files can be downloaded
(perhaps later today) from http://childes.psy.cmu.edu/audio

   For an excellent discussion of some of the patterns of language
interaction in this corpus, please see:
Yip, V. and S. Matthews. (2000) Syntactic transfer in a bilingual child.
    Bilingualism: Language and Cognition 3.3, 193-208.

Thanks to Virginia, Stephen, and Huang Yue-Yuan for this excellent
contribution to CHILDES.

--Brian MacWhinney

The Hong Kong Bilingual Corpus: Longitudinal data for Timmy
(1;05.20-3;06.25)

Contributors:
Virginia Yip 
Chinese University of Hong Kong

Stephen Matthews
University of Hong Kong

Huang Yue-Yuan
Hong Kong Baptist University

The corpus of Timmy¹s bilingual development is the first part of a project
investigating a total of five children exposed to both Cantonese and English
regularly from birth. The regular audio recording of Timmy took place
between November 1994 and December 1996.
The subject is the first-born of three siblings, born in Hong Kong on 21 May
1993. His first sister was born when he was 2;09.07 and second sister was
born when 7;0.07. His mother is a native speaker of Hong Kong Cantonese and
his father of British English. Both are professors of linguistics at
universities in Hong Kong.
Timmy¹s exposure to Cantonese and English began from birth. His father took
sabbatical leave in the USA when he was three months old, during which time
English tapes were played to Timmy. The primary caretakers in this period
were Timmy¹s maternal grandmother, his mother and a Mandarin-speaking
domestic helper. His parents took him to Los Angeles from age seven months
to one year. He then spent the summer of 1994 in Canada, the UK and briefly
in France. By the time regular audio recording started at 1;05.20, the
live-in domestic helper was a Filipino woman who spoke fluent English. A
trip to Australia was made at 3;01.17 and he visited his paternal relatives
in England for three weeks at 3;02.28.
The parents followed the one parent‹one language principle when addressing
the child. The language between the parents is mainly Cantonese with a great
deal of English mixed in, as is characteristic of the speech of Hong Kong
middle class families. Despite the one parent  one language principle, the
quantity of input from the two languages is by no means balanced: on the
whole, Timmy had more Cantonese than English input in his first three years.
The language of the community is Cantonese and the extended family (maternal
grandmother and relatives) also speak Cantonese (and in some cases Chiu
Chow). Regular input in English came solely from the father and the family's
Filipino domestic helper, while English-speaking relatives visited only
occasionally. In a number of recording sessions he showed a preference for
using Cantonese, even when the research assistants tried to induce him to
speak English.
In addition to Cantonese, Chiu Chow (or Chaozhou) was spoken by the child¹s
grandmother and some relatives. The ancestral language of a sizeable
minority in Hong Kong, Chiu Chow is spoken in eastern Guangdong province and
belongs to the southern Min dialect group. Although diverging from Cantonese
in many respects, it shares the broad typological characteristics which
contrast with English. The child showed some passive knowledge of Chiu Chow
but never produced more than occasional words.
The recording continued on a weekly basis until Timmy was 3;06.25, except
when he was away from home on a trip. Transcriptions are initially available
on an approximately biweekly basis, as a number of tapes still await
transcription. Unless otherwise stated, most of the files contain
transcription of one side of the tape, i.e. about thirty minutes of recorded
interaction between the child and other participants. Certain recordings are
unusable due to various reasons such as technical failure of the recording
instruments or failure to elicit the less preferred language on a few
occasions. The subject was reserved and sensitive as a child, which is
reflected in some of his transcripts as he at times became taciturn.
The files are classified as (I) mixed (File nos. 1-13), (II) Cantonese (File
nos. 14- 47) and (III) English (File nos. 48-85). The early mixed files
involve natural interaction between the child, investigators and members of
the family, without conscious prompting of either language. In these mixed
files, a great deal of code-switching occurs during the course of
conversation both on the part of the child and adult speakers. Subsequently,
we tried to elicit one language at a time, e.g. in the first half hour of
recording, English was spoken by one research assistant (RA) in order to
elicit English, while the other RA used Cantonese in the second half hour to
elicit Cantonese. The RAs who interacted with Timmy were all native speakers
of Cantonese except Linda Peng Ling Ling, who is a native speaker of
Mandarin and used primarily English in the later recording sessions. All the
RAs speak English as their second language.
In practice, this one person -- one language strategy did not always work as
intended for elicitation purposes. As a result one or more adults present at
the recording may be speaking both English and Cantonese to the child who in
turn code-mixes from time to time. Hence some files, especially the early
ones under the category Cantonese, for example, actually contain a
considerable amount of English and language mixing. As the child¹s languages
develop, the division into Cantonese and English files can be made more
easily. Spontaneous speech data were recorded at the child¹s home where the
routines included activities such as role-playing, playing with toys and
reading story books.
The parents also kept a diary to supplement the audio-recording data. This
enabled the researchers to address a wider range of phenomena, as certain
structures (such as relative clauses) scarcely appear in the longitudinal
corpus data (see Yip & Matthews 2000).
To facilitate comparison with monolingual Cantonese and English data, the
data collection and corpus creation were modeled on previous works: Cancorp
created by Lee et. al. (1996) which contains eight monolingual Cantonese
speaking children¹s data from 1;5  3;8, and various English-speaking
corpora (see MacWhinney 2000).

Background on Cantonese

Cantonese is the native language of some 90% of the population of Hong Kong.
Although widely considered a dialect of Chinese, Cantonese is not mutually
intelligible with Mandarin. It differs substantially from other forms of
Chinese  in grammar, as well as phonology and lexicon. There are six tones
in Hong Kong Cantonese. The romanization used is the JyutPing system
developed by the Linguistic Society of Hong Kong. IPA and Yale romanization
equivalents are given in Matthews and Yip (1994:400-401).

Grammatical categories

The grammatical category labels for the English corpus are based on the MOR
grammars for English in the CHILDES Windows Tools while those for the
Cantonese
corpus are based on those of Cancorp with thirty-three categories
distinguished, as shown in Table 1 (see MacWhinney 2000:364-365). These are
as used in Cancorp apart from the following modifications:
(i) the category 'particle' (prt) rather than 'clitic' is used for the
postverbal modal
dak1 and postverbal dou3 introducing an extent complement;
(ii) the category 'localizer' (loc) is used for locative expressions such as
dou6 as in 
zoeng1 toi2 dou6 '(lit.) the table there' as well as for expressions such as
haa6bin6 'down there' which are tagged as locative noun phrases (nnloc) in
Cancorp. 
(iii) the category 'onomatopoeic expression' (onoma) is introduced in our
Cantonese 
corpus for sounds such as wo1wo1 'barking of dogs' and baang4
'crashing/shooting noise¹.
(iv) the category 'ditransitive verb' (vd) is applied only to verbs which
allow two NP objects such as bei2 Œgive¹, excluding other three-place
predicates such as baai2 Œput¹.

Table 1 Grammatical categories for the Cantonese corpus
Syntactic categories    Examples
1. adj    adjective
    sau3, leng3, faai3, hou2teng1
thin, pretty, fast, good to listen to
2. advf    focus adverb    dou1, sin1, jau6, zung6
also, first, again, still
3. advi    adverb of intensity    gam3, hou2, taai3, zeoi3
so, very, too, most
4. advm    adverb of manner    gwaai1gwaai1dei2, maan6maan2
obediently, slowly
5. advs    sentential adverb    jan1wai6, so2ji5, bat1jyu4
because, therefore, how about
6. asp    aspectual marker    zo2, gwo6, gan2, hoi1, haa2
PFV EXP PROG HAB DEL
7. aux    auxiliary/modal verb    jing1goi1, wui5, ho2ji5, m4hou2
should, would, can, don't
8. cl    classifier    bun2, go3, gaa3, tiu4
CL
9. com    comparative morpheme    di1 as in leng3 di1, gwo6 as in leng3 gwo6
keoi5 more beautiful, prettier than her
10. conj    connective    ding6hai6, tung4maai4, waak6ze2
or, and, or
11. corr    correlative    jat1lou6Šjat1lou6, jyut6Šjyut6
while, the moreŠthe more
12. det    determiner    li1, go2, dai6
this, that, number
13. dir    directional verb    lei4/lai4, heoi3 ceot1, jap6, soeng5, lok6
come, go, out, in, go up, go down
14. ex    expressive utterance    ai1jaa3, e3, m4goi1
oops, well, please/thanks
15. gen    genitive marker    ge3  as in Timmy ge3 pang4jau5
Timmy's friends
16. ins    emphatic inserted marker    gwai2 as in gam3 gwai2 lyun6
what a mess!
17. loc    localizer    dou6 as in zoeng1 toi2 dou6, soeng6min6
on the table, up there
18. nn    noun    ce1, wun6geoi6, sing1sing1, kau3fu2
car, toy, stars, uncle
19. nnpr    pronoun    ngo5, lei5, keoi5, ngo5dei6, lei5dei6, keoi5dei6
I/me, you, s/he, we/us, you(pl), they/them
20. nnpp    proper noun    ciu1jan4, je4sou1, jing1gwok3
Superman, Jesus, Britain
21. neg    negative morpheme    m4, mai6, mou5
not, not, not have
22. onoma    onomatopoeic expression    wou1wou1, baang4, gok4gok4
ONOMA
23. prt    (postverbal) particle    dak1, dou3, saai3, maai4, jyun4
can, until, all, as well, finish
24. prep    preposition    hai2, bei2
at, for
25. q    quantifier    jat1, sap6saam1, mui5
one, thirteen, each
26. rfl    reflexive pronoun    zi6gei2
self
27. sfp    sentence-final particle    aa3, laa1, gaa3, ho2
SFP
28. vd    ditransitive verb    bei2, sung3
give, give (as a gift)
29. verg    ergative (unaccusative) verb    dit3, tyun5
fall, break
30. vf    function verb    hai6, jau5
be, have
31. vi    intransitive verb    siu3, jau1sik1, kei4tou2
smile, rest, pray
32. vt    transitive verb    sik6, gong2, zi1dou6
eat, say, know
33. wh    wh phrases    bin1go3, mat1je5(me1), bin1dou6, dim2gaai2
who, what, where, why

Morpheme tier %mor 

The %mor tier was generated using a tagging program developed by Lawrence
Cheung. Since Cantonese has many homophonous morphemes, it was necessary to
carry out disambiguation with respect to word class. The disambiguation and
checking were performed by Gene Chu and Simon Huang for both Cantonese and
English files. 

Cantonese Tier %can

The child¹s Cantonese was first transcribed using romanized Cantonese
instead of Chinese characters. The %can tier was generated at a later stage
to provide readers who can read Chinese characters with quicker access to
the speakers' utterances. Fonts for Cantonese characters are available at
the Hong Kong SAR government website, http://www.5c.org/ as well as through
Microsoft.

The same characters are used for allophonic representations of a morpheme.
Due to ongoing sound changes, there is variation especially between n/l and
ng/Ø (Matthews and Yip 1994: 29-30). For example, the first person pronoun
is represented as ngo5 in the corpus but is often pronounced o5. The second
person pronoun is represented as lei5 although the prescribed form is nei5.
For the demonstrative there are several variant forms: li1/ni1/ji1/nei1/lei1
Œthis¹. The experiential aspect marker may appear as gwo3 or go3. Other
alternative forms result from contraction, for example mat1je5 'what'
becomes me1 and hou2 m4 hou2 'is it okay?' becomes hou2 mou2.

Sound-linked files

As an initial demonstration of how transcripts can be read and heard
simultaneously using CLAN, a total of three sample audio files (two English,
one Cantonese) linked to excerpts of transcripts are provided. Subject to
sufficient funding, it is hoped to make further audio files available at a
later date, as well as to provide English glosses for the Cantonese
transcripts.

Acknowledgments

Longitudinal data of Timmy's language development were collected as part of
two projects funded by the Research Grants Council of Hong Kong: (1) RGC
ref. no. HKU336/94H to Stephen Matthews (University of Hong Kong), Virginia
Yip (Chinese University of Hong Kong) and Huang Yue-Yuan (Hong Kong Baptist
University) and (2) RGC ref. No.CUHK4402/97H to Virginia Yip and Stephen
Matthews. We gratefully acknowledge the help of our students and colleagues,
especially Linda Peng Ling Ling, Bella Leung, Lawrence Cheung, Simon Huang,
Gene Chu, Patricia Man, Winnie Chan, Betty Chan, Tommi Leung, Peggy Leung,
Chen Ee San, Shirley Sung, Uta Lam, Richard Wong and Angel Chan: a dedicated
team who became part of the family and friends of the children. The advice
of Brian MacWhinney in the last stages of the preparation of the corpus was
most timely and indispensable.

Please cite:
Yip, V. and S. Matthews. (2000) Syntactic transfer in a bilingual child.
    Bilingualism: Language and Cognition 3.3, 193-208.

References

Lee, T. H.T., Wong, C.H., Leung, C.S., Man, P., Cheung, A., Szeto, K., and
Wong,
C.S.P. (1996) The development of grammatical competence in Cantonese-
speaking children. Report of a project funded by Research Grants /Council,
Chinese University of Hong Kong.
MacWhinney, B. (2000) The CHILDES Project: Tools for analyzing talk.
    3rd edition. Mahwah, N.J.: Lawrence Erlbaum Associates.
Matthews, S. and Yip, V. (1994) Cantonese: A comprehensive grammar. London:
Routledge.
Yip, V. and Matthews, S. (2000) Syntactic transfer in a bilingual child.
    Bilingualism: Language and Cognition 3.3, 193-208.

Contact address:
Prof. Virginia Yip
Dept. of Modern Languages & Intercultural Studies
Chinese University of Hong Kong
Shatin, N.T.
Hong Kong
vyip at humanum.arts.cuhk.edu.hk

Inventory of Files

I. Mixed files (no. 1-13)
File no.    File name (Tiyymmdd)    Age of CHI
1.     Ti941110    1;05.20
2.     Ti941201    1;06.10
3.     Ti941215    1;06.24
4.     Ti941222    1;07.01
5.     Ti950113    1;07.23
6.     Ti950127    1;08.06
7.     Ti950216    1;08.26
8.     Ti950309    1;09.16
9.     Ti950323    1;10.02
10.     Ti950421    1;11.00
11.     Ti950512    1;11.21
12.     Ti950525    2;00.04
13.     Ti950629    2;01.08

II. Cantonese files (no. 14-47)
File no.    File name (Tiyymmdd)    Age of CHI
14.     Ti950713    2;01.22
15.     Ti950720    2;01.29
16.     Ti950810    2;02.20
17.     Ti950817    2;02.27
18.     Ti950907    2;03.17
19.     Ti951005    2;04.14
20.     Ti951019    2;04.28
21.     Ti951102    2;05.12
22.     Ti951116    2;05.26
23.     Ti951130    2;06.09
24.     Ti951207    2;06.19
25.     Ti951221    2;07.00
26.     Ti960104    2;07.14
27.     Ti960118    2;07.28
28.     Ti960208    2;08.18
29.     Ti960229    2;09.08
30.     Ti960314    2;09.22
31.     Ti960328    2;10.07
32.     Ti960418    2;10.28
33.     Ti960503    2;11.12
34.     Ti960516    2;11.25
35.     Ti960530    3;00.09
36.     Ti960606    3;00.16
37.     Ti960613    3;00.23
38.     Ti960621    3;01.00
39.     Ti960704    3;01.13
40.     Ti960724    3;02.03
41.     Ti960816    3;02.26
42.     Ti961006    3;04.15
43.     Ti961021    3;05.00
44.     Ti961104    3;05.14
45.     Ti961118    3;05.28
46.     Ti961202    3;06.11
47.     Ti961216    3;06.25

III. English files (no. 48-85)
File no.    File name (Tiyymmdd)    Age of CHI
48.     Ti950616    2;00.26
49.     Ti950623    2;01.02
50.     Ti950713    2;01.22
51.     Ti950817    2;02.27
52.     Ti950907    2;03.17
53.     Ti950928    2;04.07
54.     Ti951005    2;04.14
55.     Ti951012    2;04.21
56.     Ti951019    2;04.28
57.     Ti951026    2;05.05
58.     Ti951102    2;05.12
59.     Ti951109    2;05.19
60.     Ti951130    2;06.09
61.     Ti951221    2;07.00
62.     Ti951228    2;07.07
63.     Ti960118    2;07.28
64.     Ti960125    2;08.04
65.     Ti960208    2;08.18
66.     Ti960215    2;08.25
67.     Ti960307    2;09.15
68.     Ti960314    2;09.22
69.     Ti960321    2;10.00
70.     Ti960328    2;10.07
71.     Ti960411    2;10.21
72.     Ti960418    2;10.28
73.     Ti960503    2;11.12
74.     Ti960509    2;11.18
75.     Ti960530    3;00.09
76.     Ti960621    3;01.00
77.     Ti960627    3;01.06
78.     Ti960704    3;01.13
79.     Ti960724    3;02.03
80.     Ti961006    3;04.15
81.     Ti961021    3;05.00
82.     Ti961104    3;05.14
83.     Ti961118    3;05.28
84.     Ti961202    3;06.11
85.     Ti961216    3;06.25