new Sesotho corpus
Brian MacWhinney
macw at mac.com
Tue Aug 17 14:26:35 UTC 2004
Dear Info-CHILDES,
I am happy to announce the addition to CHILDES of a corpus of
Sesotho child language from Katherine Demuth at Brown University. This
is the first Bantu language corpus in CHILDES and also the first corpus
from an indigenous African language. Audio files are also available,
but they are not yet linked to the transcripts. A brief description of
the corpus follows. A fuller description is available online. Many
thanks to Katherine for this important contribution.
--Brian MacWhinney
The Demuth Sesotho Corpus was compiled by Katherine Demuth in the
southern African country of Lesotho from 1980-82. Data was collected in
a small Lesotho mountain village of 550 people in the district of
Mokhotlong, where it was possible to establish close rapport with both
the children and their families. The Corpus contains a longitudinal
study of four target children’s language development as they interacted
with members of the extended family including mothers and/or
grandmothers, an uncle and occasionally the father (in one family), and
especially older siblings, cousins, and peers. These target children
are: Hlobohang (boy) 2;1-3;0, Litlhare (girl) 2;1-3;2, ‘Neuoe (girl)
2;4-2;9, and Tsebo (girl) 3;8-4;1. The two older girls were cousins
living in the same household, and where therefore recorded together.
Monthly recordings of spontaneous speech consisted of 3-4 hours each
over approximately one year, resulting in a corpus of 98 hours of
speech containing approximately 13,250 utterances containing lexical
verbs or approximately 1/2 million morphemes.
Broad phonemic transcription was conducted by Katherine Demuth with the
assistance of the mothers and grandmothers as soon as recording
sessions were complete. These transcripts were then verified
independently by a researcher at the National University of Lesotho.
The original transcription was by hand. The data were subsequently
computerized and one third of the corpus hand-tagged by Sesotho
speakers at Brown University in the 1990’s. A computational
morphological parser was then developed (with the assistance of Mark
Johnson, Brown University) to tag the remaining part of the corpus,
files were then converted to CHILDES format, and the audio tapes were
digitized. These still remain to be linked to the transcripts.
The corpus should therefore be accessible to and of broad interest for
researchers of child language, Bantu linguistic structures, and
computational linguists interested in morphological parsing and/or
machine translation. Collection of this corpus was supported in part
by Fulbright and SSRC (Social Science Research Council) dissertation
funding. Computerization and tagging of the corpus has been supported
in part by NSF grants BNS-08709938 and SBR-9727897. I thank all who
have assisted with this research over the years, and the children and
families who provided the data.
Those wishing to use this corpus should notify
Katherine_Demuth at brown.edu, and cite the following reference:
Demuth, K. 1992. Acquisition of Sesotho. In D. Slobin (ed.), The
Cross-Linguistic Study of Language Acquisition, vol 3, 557-638.
Hillsdale, N.J.: Lawrence Erlbaum Associates.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3435 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/info-childes/attachments/20040817/3a908306/attachment.bin>
More information about the Info-childes
mailing list