Noji corpus
Brian MacWhinney
macwhinn at hku.hk
Sat Aug 4 10:04:25 UTC 2001
Dear Info-CHILDES,
I am happy to announce the addition to CHILDES of a large corpus of
Japanese child language data that was collected beginning in 1948 by Junya
Noji from his son Sumihare. The current corpus was prepared and contributed
by Norio Naka and Susanne Miyata with the permission of Noji. Thanks to all
of them for this contribution.
--Brian MacWhinney
***************************
*** Noji Corpus v.1.1 *****
***************************
Susanne Miyata
Faculty of Creativity and Culture
Aichi Shukutoku University
Sakuragaoka 23, Chikusa-ku
Nagoya, Japan 464-8671
smiyata at asu.aasa.ac.jp
***History***
The Noji Corpus contains diary data collected by the Japanese linguist and
dialectologist Junya Noji. He observed his first-born son Sumihare from
birth (1948, March, 9th) until the age of 7, as he was growing up in
Hiroshima. The data is based on handwritten records collected virtually
daily
(2243 days over 7 years), although the focus lies in the 3rd year. In the
later years, less records were taken, resulting in a lower number of
utterances available per month. Detailed description of the methodology can
be found in the printed edition (Bunka Hyoron Shuppan).
The data contains approximately 40,000 utterances by Sumihare, and about
22,000 utterances by other family members (his mother and father and his
younger brother, Teruki) and other speakers such as the children from the
neighborhood (Seejikun and Keekochan). A comment is provided for each
utterance, establishing the context and interpreting the child's utterance.
The electronic version of this data was entered, compared to the original,
and adjusted to CHAT format by Norio Naka (Osaka Gakuin U.). The final
brush up using CHECK was done by Susanne Miyata (Aichi Shukutoku U.).
***Format***
The print original uses katakana (phonetic syllable script) for the
utterances, and regular hiragana (syllabic) and kanji (Chinese characters)
for the comments, as well as a number of special symbols such as arrows
to indicate the speaker and the addressee. The electronic version was done
in Hebon (Hepburn transcription system) and separated into words (wakachi;
spoken utterances only). The format follows the Japanese adaption of CHAT,
JCHAT 1.0 (Oshima-Takane & MacWhinney, eds., 1998).
When the data entry began in 1992, only ASCII was available within the
CHILDES system. But now, even though there is no longer any restriction
concerning the fonts, the use of Hebon (at least in the main line) has the
advantage of compatibility with programs such as MOR, and renders the data
accessible to a greater number of researchers by removing the barrier of
Japanese script.
***Warnings***
1) The wakachi (word separation) format is not yet adjusted to the JMOR-
compatible WAKACHI99 format.
2) Words are transcribed as pronounced (e.g. 'futachu' for 'futatsu')
3) Proper names are not capitalized.
when using this corpus please cite:
Noji, Junya. (1973-77). Yooji no gengo seikatsu no jittai I -IV.
Bunka Hyoron Shuppan.
***Table of Contents***
#########################################################
year month age # of # of utt.
files
(days) SUM others all utt.
########################################################
1948 3 0;0 26 0 0 0
4 0;1 29 0 0 0
5 0;2 30 1 1 2
6 0;3 30 6 2 8
7 0;4 17 2 1 3
8 0;5 17 0 0 0
9 0;6 17 2 0 2
10 0;7 19 0 0 0
11 0;8 14 4 1 5
12 0;9 16 18 2 20
1949 1 0;10 19 17 11 28
2 0;11 27 52 27 79
#######################################################
261 102 45 147
#######################################################
3 1;0 24 65 28 93
4 1;1 20 67 31 98
5 1;2 27 61 26 87
6 1;3 29 110 43 153
7 1;4 30 81 26 107
8 1;5 31 342 97 439
9 1;6 30 453 149 602
10 1;7 31 436 178 614
11 1;8 30 425 144 569
12 1;9 31 414 125 539
1950 1 1;10 30 349 66 415
2 1;11 28 820 146 966
#######################################################
341 3.623 1.059 4.682
#######################################################
3 2;0 31 800 137 937
4 2;1 30 1.892 571 2463
5 2;2 31 3.201 1.050 4251
6 2;3 30 1.198 423 1621
7 2;4 30 1.280 557 1837
8 2;5 31 1.779 971 2750
9 2;6 30 939 419 1358
10 2;7 31 1.317 524 1841
11 2;8 30 1.368 641 2009
12 2;9 31 1.312 727 2039
1951 1 2;10 31 991 719 1710
2 2;11 28 771 518 1289
#########################################################
364 16848 7257 24.105
#########################################################
3 3;0 31 709 477 1186
4 3;1 30 847 542 1389
5 3;2 31 918 584 1502
6 3;3 30 1071 792 1863
7 3;4 31 1024 754 1778
8 3;5 31 689 517 1206
9 3;6 30 493 375 868
10 3;7 31 1321 870 2191
11 3;8 30 865 631 1496
12 3;9 31 620 519 1139
1952 1 3;10 30 537 337 874
2 3;11 29 497 375 872
##########################################################
365 9.591 6.773 16.364
##########################################################
3 4;0 31 576 435 1011
4 4;1 30 523 344 867
5 4;2 31 285 236 521
6 4;3 30 365 206 571
7 4;4 31 315 172 487
8 4;5 30 242 140 382
9 4;6 27 202 118 320
10 4;7 31 249 169 418
11 4;8 27 262 166 428
12 4;9 30 612 392 1004
1953 1 4;10 29 476 348 824
2 4;11 28 410 284 694
##########################################################
355 4.517 3.010 7.527
##########################################################
3 5;0 30 279 189 468
4 5;1 30 366 262 628
5 5;2 31 322 238 560
6 5;3 29 286 186 472
7 5;4 31 337 217 554
8 5;5 31 362 296 658
9 5;6 30 393 347 740
10 5;7 26 163 161 324
11 5;8 29 248 186 434
12 5;9 28 343 313 656
1954 1 5;10 24 172 150 322
2 5;11 25 167 162 329
#########################################################
344 3.438 2.707 6.145
##########################################################
3 6;0 16 97 64 161
4 6;1 18 85 65 150
5 6;2 18 105 77 182
6 6;3 26 251 224 475
7 6;4 29 359 297 656
8 6;5 29 346 233 579
9 6;6 25 111 115 226
10 6;7 14 44 50 94
11 6;8 8 23 29 52
12 6;9 11 50 43 93
1955 1 6;10 15 47 50 97
2 6;11 3 4 6 10
#########################################################
212 1522 1.253 2.775
#########################################################
total sum 2.242 39.641 22.104 61.745
*****end
More information about the Info-childes
mailing list