Criteria for CHILDES databases that are in the making

Brian MacWhinney macw at cmu.edu
Tue Sep 22 21:55:48 UTC 2015


Dear Alex,

     The Stockman criteria, which derive from earlier analyses from Duncan and others, are reasonable.  However, it would be best to use the CLAN manual as your guide to utterance segmentation.  I have just now gone over section 7.1 to 7.6 of the manual to make sure it covers the relevant issues.  Why don’t you grab a new copy to check this out? 

Also, Nan has agreed to revise these sections of the SLP manual to include a summarization of the material from that section of the CLAN manual.  

Regarding the transcription of babbling forms, it is nice to transcribe as much as you can with forms like &mmm.  As long as you are just doing a rough transcription, this at least provides a clearer skeleton of the actual production.  Marking everything as just &=babbles might be faster, but not as revealing.  Of course, if your transcripts are linked to audio, one could always go back later and transcribe these forms in detail.

The issue of transcribing xxx is similar.  If one can discern the phonology it is better to transcribe as &bala or whatever, but really this is not going to be crucial for most analyses.

The decision about what to count as a turn is mostly important for the MLT program which allows the user to choose various ways of defining turns.

You are right that existing corpora vary a lot in terms of their adherence to utterance segmentation principles.  For English, the biggest problems are with the Kuczaj, Hall, and Belfast corpora which have a lot of run-on sentences.  Other corpora, such as Bloom, Gleason, MacWhinney, New England etc. are typically pretty well done.   I haven’t taken a close look at Spanish in this regard, but I can do that when I have time.  

Best regards,

— Brian MacWhinney

> On Sep 22, 2015, at 5:27 AM, A Cristia <alecristia at gmail.com> wrote:
> 
> Dear all,
> 
> With Celia Rosemberg's group, we are working on a database that we hope to contribute to CHILDES. We have been poring over the CHAT and CLAN manuals, the SLP's guide to CLAN, and old chibolts discussions, all hugely helpful, but even so there are some important decisions we are not sure how to make optimally (i.e., jointly optimize minimal work and maximal quality). We would really appreciate a discussion with all of you, who are more experienced than us. 
> 
> 1) On the definition of an utterance in general. The "SLP's guide to CLAN" explains very clearly the "2 out of 3" rule:
> "Each main line should contain only one C-unit  [...] Although defining an utterance (C-unit) may seem easy, it’s a very real area of disagreement in transcription  [...] Because [Stockman (2010)] notes increasing reliability with multiple cues, we have frequently used a “2 out of 3” criterion to define utterances in transcribing. The 3 features are:
> 1. Silence or pause of more than 2 seconds 
> 2. Terminal intonation contour 
> 3. Syntax that makes a complete sentence, or word(s) that make a complete, appropriate contribution in conversation, as in, MOT: where are you going? CHI: home. (one word, but an utterance) 
> Our rule is: If you hear two or three of these, you have an utterance. If you hear only one, keep what follows as the same line, utterance, c-unit."
> But is it the case that CHILDES contributors follow this guideline? If not, doesn't this make it difficult to compare across corpora within CHILDES? 

> 
> 2) On the definition and transcription of young children's utterances in particular. The children we are looking at are between 10 and 30 months of age, so many of them produce sounds that are not easy to recognize (and perhaps not intended) as real words. For our own purposes, we'd like to know when they produce real words and real attempts to intervene in a conversation, but not the phonetic/phonological details of their production. So if we don't hear anything that could be interpreted as potential words in context, we might just code 
> CHI: &=babbles .
> or
> CHI: 0 [=! babbles] .
> 
> But there are lots of gray areas, most saliently "ma" syllables (did he intend to say mother, "mamá"?), leading to transcriptions like:
> CHI: &mmm  ma(má) [=! babbles] .
> CHI: &mmm  ma(má) &mmm &mmm  ma(má) [=! babbles] .
> 
> The first two will be totally ignored in terms of number of words; whereas the latter two will count as giving the child credit for speaking 1 and 2 real words respectively. Since this is all we care about, we could just not code any of the & items, correct? But would this be undesirable for integration to other CHILDES corpora?
> 
> Additionally, should all of these 4 examples count as "turns"? Should they count as "utterances"? Is the last one really a 2-word utterance? (We are assuming that nothing with & should be counted, as explained in a previous discussion https://groups.google.com/forum/#!searchin/chibolts/mlu$20mlt$20/chibolts/KaPRl6SUgTs/d2yTa1qg24gJ)
> 
> 3) Cross-transcriber reliability. We have many different transcribers, and so far each transcript is prepared by only one person, so we do not have any data to draw reliability from. Although they are all looked over by an experienced transcriber, it is clear that different people make slightly different decisions as to when to break utterances, how to code within-sentence repetition, when to code items as being babbling or real words, and even how and when to mark pronunciation variants in adults' speech. There is also a lot of variability across transcripts in the amount of "xxx", and right now we cannot know whether that is a property of the family (a TV on make it hard for anyone to transcribe) or the transcriber. So we definitely need some reliability, but how shall we do it, how much should be recoded, and how good should it be? I know there is a sizable literature on reliability, but what I'd like to know is, for the corpus to be a valuable contribution to CHILDES, what should we aim for? In realistic terms, what do CHILDES users and contributors expect/do?
> 
> 4) Prime examples in CHILDES. Coder reliability will allow us to make our data internally consistent, but still doesn't assure us that we're doing it at the same level of quality and/or with the same parameters as others in CHILDES. Is there a corpus that seems to embody the ideal of CHAT, which we could use as benchmark? We are particularly interested in Spanish, because that is accessible to all our coders, but if there is an English corpus that people believe is a great example to train oneself with, we would welcome the recommendation. (To be frank, I did look at the Spanish ones, but it wasn't clear to me that they were following e.g. the 2-out-of-3 rule, and other CHAT guidelines concerning within-sentence repetition, etc... But perhaps I'm not looking at the right things, or in the right way.)
> 
> 
> If any of the above are topics you would also like to discuss broadly, reply to the list, but if not, please don't hesitate to reply to us privately, and we'll post an anonymized general response next week. 
> 
> Thank you all in advance for your candid responses,
> 
> Alex Cristia
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "chibolts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com <mailto:chibolts+unsubscribe at googlegroups.com>.
> To post to this group, send email to chibolts at googlegroups.com <mailto:chibolts at googlegroups.com>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com <https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To post to this group, send email to chibolts at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/AE61896D-8FB1-401B-9DC9-A9FAD6A1F3A6%40cmu.edu.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20150922/03fc1584/attachment.htm>


More information about the Chibolts mailing list