Criteria for CHILDES databases that are in the making

Tue Sep 22 09:27:52 UTC 2015

Dear all,

With Celia Rosemberg's group, we are working on a database that we hope to 
contribute to CHILDES. We have been poring over the CHAT and CLAN manuals, 
the SLP's guide to CLAN, and old chibolts discussions, all hugely helpful, 
but even so there are some important decisions we are not sure how to make 
optimally (i.e., jointly optimize minimal work and maximal quality). We 
would really appreciate a discussion with all of you, who are more 
experienced than us. 

1) On the definition of an utterance in general. The "SLP's guide to CLAN" 
explains very clearly the "2 out of 3" rule:
"Each main line should contain only one C-unit  [...] Although defining an 
utterance (C-unit) may seem easy, it’s a very real area of disagreement in 
transcription  [...] Because [Stockman (2010)] notes increasing reliability 
with multiple cues, we have frequently used a “2 out of 3” criterion 
to define utterances in transcribing. The 3 features are:
1. Silence or pause of more than 2 seconds 
2. Terminal intonation contour 
3. Syntax that makes a complete sentence, or word(s) that make a complete, 
appropriate contribution in conversation, as in, MOT: where are you going? 
CHI: home. (one word, but an utterance) 
Our rule is: If you hear two or three of these, you have an utterance. If 
you hear only one, keep what follows as the same line, utterance, c-unit."
But is it the case that CHILDES contributors follow this guideline? If not, 
doesn't this make it difficult to compare across corpora within CHILDES? 

2) On the definition and transcription of young children's utterances in 
particular. The children we are looking at are between 10 and 30 months of 
age, so many of them produce sounds that are not easy to recognize (and 
perhaps not intended) as real words. For our own purposes, we'd like to 
know when they produce real words and real attempts to intervene in a 
conversation, but not the phonetic/phonological details of their 
production. So if we don't hear anything that could be interpreted as 
potential words in context, we might just code 
CHI: &=babbles .
or
CHI: 0 [=! babbles] .

But there are lots of gray areas, most saliently "ma" syllables (did he 
intend to say mother, "mamá"?), leading to transcriptions like:
CHI: &mmm  ma(má) [=! babbles] .
CHI: &mmm  ma(má) &mmm &mmm  ma(má) [=! babbles] .

The first two will be totally ignored in terms of number of words; whereas 
the latter two will count as giving the child credit for speaking 1 and 2 
real words respectively. Since this is all we care about, we could just not 
code any of the & items, correct? But would this be undesirable for 
integration to other CHILDES corpora?

Additionally, should all of these 4 examples count as "turns"? Should they 
count as "utterances"? Is the last one really a 2-word utterance? (We are 
assuming that nothing with & should be counted, as explained in a previous 
discussion 
https://groups.google.com/forum/#!searchin/chibolts/mlu$20mlt$20/chibolts/KaPRl6SUgTs/d2yTa1qg24gJ)

3) Cross-transcriber reliability. We have many different transcribers, and 
so far each transcript is prepared by only one person, so we do not have 
any data to draw reliability from. Although they are all looked over by an 
experienced transcriber, it is clear that different people make slightly 
different decisions as to when to break utterances, how to code 
within-sentence repetition, when to code items as being babbling or real 
words, and even how and when to mark pronunciation variants in adults' 
speech. There is also a lot of variability across transcripts in the amount 
of "xxx", and right now we cannot know whether that is a property of the 
family (a TV on make it hard for anyone to transcribe) or the transcriber. 
So we definitely need some reliability, but how shall we do it, how much 
should be recoded, and how good should it be? I know there is a sizable 
literature on reliability, but what I'd like to know is, for the corpus to 
be a valuable contribution to CHILDES, what should we aim for? In realistic 
terms, what do CHILDES users and contributors expect/do?

4) Prime examples in CHILDES. Coder reliability will allow us to make our 
data internally consistent, but still doesn't assure us that we're doing it 
at the same level of quality and/or with the same parameters as others in 
CHILDES. Is there a corpus that seems to embody the ideal of CHAT, which we 
could use as benchmark? We are particularly interested in Spanish, because 
that is accessible to all our coders, but if there is an English corpus 
that people believe is a great example to train oneself with, we would 
welcome the recommendation. (To be frank, I did look at the Spanish ones, 
but it wasn't clear to me that they were following e.g. the 2-out-of-3 
rule, and other CHAT guidelines concerning within-sentence repetition, 
etc... But perhaps I'm not looking at the right things, or in the right 
way.)

If any of the above are topics you would also like to discuss broadly, 
reply to the list, but if not, please don't hesitate to reply to us 
privately, and we'll post an anonymized general response next week. 

Thank you all in advance for your candid responses,

Alex Cristia

-- 
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To post to this group, send email to chibolts at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20150922/78228150/attachment.htm>