new Sesotho corpus

Brian MacWhinney macw at mac.com
Tue Aug 17 14:26:35 UTC 2004


Dear Info-CHILDES,
     I am happy to announce the addition to CHILDES of a corpus of 
Sesotho child language from Katherine Demuth at Brown University.  This 
is the first Bantu language corpus in CHILDES and also the first corpus 
from an indigenous African language.  Audio files are also available, 
but they are not yet linked to the transcripts.  A brief description of 
the corpus follows.  A fuller description is available online.  Many 
thanks to Katherine for this important contribution.

--Brian MacWhinney

The Demuth Sesotho Corpus was compiled by Katherine Demuth in the 
southern African country of Lesotho from 1980-82. Data was collected in 
a small Lesotho mountain village of 550 people in the district of 
Mokhotlong, where it was possible to establish close rapport with both 
the children and their families. The Corpus contains a longitudinal 
study of four target children’s language development as they interacted 
with members of the extended family including mothers and/or 
grandmothers, an uncle and occasionally the father (in one family), and 
especially older siblings, cousins, and peers. These target children 
are:  Hlobohang (boy) 2;1-3;0, Litlhare (girl) 2;1-3;2, ‘Neuoe (girl) 
2;4-2;9, and Tsebo (girl) 3;8-4;1.  The two older girls were cousins 
living in the same household, and where therefore recorded together.  
Monthly recordings of spontaneous speech consisted of 3-4 hours each 
over approximately one year, resulting in a corpus of 98 hours of 
speech containing approximately 13,250 utterances containing lexical 
verbs or approximately 1/2 million morphemes.

Broad phonemic transcription was conducted by Katherine Demuth with the 
assistance of the mothers and grandmothers as soon as recording 
sessions were complete.  These transcripts were then verified 
independently by a researcher at the National University of Lesotho. 
The original transcription was by hand.  The data were subsequently 
computerized and one third of the corpus hand-tagged by Sesotho 
speakers at Brown University in the 1990’s.  A computational 
morphological parser was then developed (with the assistance of Mark 
Johnson, Brown University) to tag the remaining part of the corpus, 
files were then converted to CHILDES format, and the audio tapes were 
digitized. These still remain to be linked to the transcripts.

The corpus should therefore be accessible to and of broad interest for 
researchers of child language, Bantu linguistic structures, and 
computational linguists interested in morphological parsing and/or 
machine translation.  Collection of this corpus was supported in part 
by Fulbright and SSRC (Social Science Research Council) dissertation 
funding.  Computerization and tagging of the corpus has been supported 
in part by NSF grants BNS-08709938 and SBR-9727897.  I thank all who 
have assisted with this research over the years, and the children and 
families who provided the data.

Those wishing to use this corpus should notify 
Katherine_Demuth at brown.edu, and cite the following reference:

Demuth, K.  1992.  Acquisition of Sesotho.  In D. Slobin (ed.), The 
Cross-Linguistic Study of Language Acquisition, vol 3, 557-638.  
Hillsdale, N.J.: Lawrence Erlbaum Associates.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3435 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/info-childes/attachments/20040817/3a908306/attachment.bin>


More information about the Info-childes mailing list