<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Dear Alex,<div class=""><br class=""><div class=""> The Stockman criteria, which derive from earlier analyses from Duncan and others, are reasonable. However, it would be best to use the CLAN manual as your guide to utterance segmentation. I have just now gone over section 7.1 to 7.6 of the manual to make sure it covers the relevant issues. Why don’t you grab a new copy to check this out? </div><div class=""><br class=""></div><div class="">Also, Nan has agreed to revise these sections of the SLP manual to include a summarization of the material from that section of the CLAN manual. </div><div class=""><br class=""></div><div class="">Regarding the transcription of babbling forms, it is nice to transcribe as much as you can with forms like &mmm. As long as you are just doing a rough transcription, this at least provides a clearer skeleton of the actual production. Marking everything as just &=babbles might be faster, but not as revealing. Of course, if your transcripts are linked to audio, one could always go back later and transcribe these forms in detail.</div><div class=""><br class=""></div><div class="">The issue of transcribing xxx is similar. If one can discern the phonology it is better to transcribe as &bala or whatever, but really this is not going to be crucial for most analyses.</div><div class=""><br class=""></div><div class="">The decision about what to count as a turn is mostly important for the MLT program which allows the user to choose various ways of defining turns.</div><div class=""><br class=""></div><div class="">You are right that existing corpora vary a lot in terms of their adherence to utterance segmentation principles. For English, the biggest problems are with the Kuczaj, Hall, and Belfast corpora which have a lot of run-on sentences. Other corpora, such as Bloom, Gleason, MacWhinney, New England etc. are typically pretty well done. I haven’t taken a close look at Spanish in this regard, but I can do that when I have time. </div><div class=""><br class=""></div><div class="">Best regards,</div><div class=""><br class=""></div><div class="">— Brian MacWhinney</div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Sep 22, 2015, at 5:27 AM, A Cristia <<a href="mailto:alecristia@gmail.com" class="">alecristia@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Dear all,<div class=""><br class=""></div><div class="">With Celia Rosemberg's group, we are working on a database that we hope to contribute to CHILDES. We have been poring over the CHAT and CLAN manuals, the SLP's guide to CLAN, and old chibolts discussions, all hugely helpful, but even so there are some important decisions we are not sure how to make optimally (i.e., jointly optimize minimal work and maximal quality). We would really appreciate a discussion with all of you, who are more experienced than us. </div><div class=""><br class=""></div><div class="">1) On the definition of an utterance in general. The "SLP's guide to CLAN" explains very clearly the "2 out of 3" rule:</div><div class="">"Each main line should contain only one C-unit [...] Although defining an utterance (C-unit) may seem easy, it’s a very real
area of disagreement in transcription [...] Because [Stockman (2010)] notes increasing reliability with multiple cues, we have
frequently used a “2 out of 3” criterion to define utterances in transcribing. The 3 features are:</div><div class="">1. Silence or pause of more than 2 seconds </div><div class="">2. Terminal intonation contour </div><div class="">3. Syntax that makes a complete sentence, or word(s) that make a complete, appropriate
contribution in conversation, as in, MOT: where are you going? CHI: home. (one word,
but an utterance) </div><div class="">Our rule is: If you hear two or three of these, you have an utterance. If you hear only one, keep
what follows as the same line, utterance, c-unit."<br class=""></div><div class="">But is it the case that CHILDES contributors follow this guideline? If not, doesn't this make it difficult to compare across corpora within CHILDES? </div></div></div></blockquote></div><div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">2) On the definition and transcription of young children's utterances in particular. The children we are looking at are between 10 and 30 months of age, so many of them produce sounds that are not easy to recognize (and perhaps not intended) as real words. For our own purposes, we'd like to know when they produce real words and real attempts to intervene in a conversation, but not the phonetic/phonological details of their production. So if we don't hear anything that could be interpreted as potential words in context, we might just code </div><div class="">CHI: &=babbles .</div><div class="">or</div><div class="">CHI: 0 [=! babbles] .</div><div class=""><br class=""></div><div class="">But there are lots of gray areas, most saliently "ma" syllables (did he intend to say mother, "mamá"?), leading to transcriptions like:</div><div class="">CHI: &mmm ma(má) [=! babbles] .</div><div class="">CHI: &mmm ma(má) &mmm &mmm ma(má) [=! babbles] .</div><div class=""><br class=""></div><div class="">The first two will be totally ignored in terms of number of words; whereas the latter two will count as giving the child credit for speaking 1 and 2 real words respectively. Since this is all we care about, we could just not code any of the & items, correct? But would this be undesirable for integration to other CHILDES corpora?</div><div class=""><br class=""></div><div class="">Additionally, should all of these 4 examples count as "turns"? Should they count as "utterances"? Is the last one really a 2-word utterance? (We are assuming that nothing with & should be counted, as explained in a previous discussion <span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap; background-color: transparent;" class=""><a href="https://groups.google.com/forum/#!searchin/chibolts/mlu$20mlt$20/chibolts/KaPRl6SUgTs/d2yTa1qg24gJ" class="">https://groups.google.com/forum/#!searchin/chibolts/mlu$20mlt$20/chibolts/KaPRl6SUgTs/d2yTa1qg24gJ</a>)</span></div><div class=""><br class=""></div><div class="">3) Cross-transcriber reliability. We have many different transcribers, and so far each transcript is prepared by only one person, so we do not have any data to draw reliability from. Although they are all looked over by an experienced transcriber, it is clear that different people make slightly different decisions as to when to break utterances, how to code within-sentence repetition, when to code items as being babbling or real words, and even how and when to mark pronunciation variants in adults' speech. There is also a lot of variability across transcripts in the amount of "xxx", and right now we cannot know whether that is a property of the family (a TV on make it hard for anyone to transcribe) or the transcriber. So we definitely need some reliability, but how shall we do it, how much should be recoded, and how good should it be? I know there is a sizable literature on reliability, but what I'd like to know is, for the corpus to be a valuable contribution to CHILDES, what should we aim for? In realistic terms, what do CHILDES users and contributors expect/do?<br class=""></div><div class=""><br class=""></div><div class=""><div class="">4) Prime examples in CHILDES. Coder reliability will allow us to make our data internally consistent, but still doesn't assure us that we're doing it at the same level of quality and/or with the same parameters as others in CHILDES. Is there a corpus that seems to embody the ideal of CHAT, which we could use as benchmark? We are particularly interested in Spanish, because that is accessible to all our coders, but if there is an English corpus that people believe is a great example to train oneself with, we would welcome the recommendation. (To be frank, I did look at the Spanish ones, but it wasn't clear to me that they were following e.g. the 2-out-of-3 rule, and other CHAT guidelines concerning within-sentence repetition, etc... But perhaps I'm not looking at the right things, or in the right way.)</div><div class=""><br class=""></div></div><div class=""><br class=""></div><div class="">If any of the above are topics you would also like to discuss broadly, reply to the list, but if not, please don't hesitate to reply to us privately, and we'll post an anonymized general response next week. </div><div class=""><br class=""></div><div class="">Thank you all in advance for your candid responses,</div><div class=""><br class=""></div><div class="">Alex Cristia</div><div class=""><br class=""></div></div><div class=""><br class="webkit-block-placeholder"></div>
-- <br class="">
You received this message because you are subscribed to the Google Groups "chibolts" group.<br class="">
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="mailto:chibolts+unsubscribe@googlegroups.com" class="">chibolts+unsubscribe@googlegroups.com</a>.<br class="">
To post to this group, send email to <a href="mailto:chibolts@googlegroups.com" class="">chibolts@googlegroups.com</a>.<br class="">
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com?utm_medium=email&utm_source=footer" class="">https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com</a>.<br class="">
For more options, visit <a href="https://groups.google.com/d/optout" class="">https://groups.google.com/d/optout</a>.<br class="">
</div></blockquote></div><br class=""></div></div></body></html>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups "chibolts" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="mailto:chibolts+unsubscribe@googlegroups.com">chibolts+unsubscribe@googlegroups.com</a>.<br />
To post to this group, send email to <a href="mailto:chibolts@googlegroups.com">chibolts@googlegroups.com</a>.<br />
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/chibolts/AE61896D-8FB1-401B-9DC9-A9FAD6A1F3A6%40cmu.edu?utm_medium=email&utm_source=footer">https://groups.google.com/d/msgid/chibolts/AE61896D-8FB1-401B-9DC9-A9FAD6A1F3A6%40cmu.edu</a>.<br />
For more options, visit <a href="https://groups.google.com/d/optout">https://groups.google.com/d/optout</a>.<br />