Criteria for CHILDES databases that are in the making
A Cristia
alecristia at gmail.com
Tue Sep 22 09:27:52 UTC 2015
Dear all,
With Celia Rosemberg's group, we are working on a database that we hope to
contribute to CHILDES. We have been poring over the CHAT and CLAN manuals,
the SLP's guide to CLAN, and old chibolts discussions, all hugely helpful,
but even so there are some important decisions we are not sure how to make
optimally (i.e., jointly optimize minimal work and maximal quality). We
would really appreciate a discussion with all of you, who are more
experienced than us.
1) On the definition of an utterance in general. The "SLP's guide to CLAN"
explains very clearly the "2 out of 3" rule:
"Each main line should contain only one C-unit [...] Although defining an
utterance (C-unit) may seem easy, it’s a very real area of disagreement in
transcription [...] Because [Stockman (2010)] notes increasing reliability
with multiple cues, we have frequently used a “2 out of 3” criterion
to define utterances in transcribing. The 3 features are:
1. Silence or pause of more than 2 seconds
2. Terminal intonation contour
3. Syntax that makes a complete sentence, or word(s) that make a complete,
appropriate contribution in conversation, as in, MOT: where are you going?
CHI: home. (one word, but an utterance)
Our rule is: If you hear two or three of these, you have an utterance. If
you hear only one, keep what follows as the same line, utterance, c-unit."
But is it the case that CHILDES contributors follow this guideline? If not,
doesn't this make it difficult to compare across corpora within CHILDES?
2) On the definition and transcription of young children's utterances in
particular. The children we are looking at are between 10 and 30 months of
age, so many of them produce sounds that are not easy to recognize (and
perhaps not intended) as real words. For our own purposes, we'd like to
know when they produce real words and real attempts to intervene in a
conversation, but not the phonetic/phonological details of their
production. So if we don't hear anything that could be interpreted as
potential words in context, we might just code
CHI: &=babbles .
or
CHI: 0 [=! babbles] .
But there are lots of gray areas, most saliently "ma" syllables (did he
intend to say mother, "mamá"?), leading to transcriptions like:
CHI: &mmm ma(má) [=! babbles] .
CHI: &mmm ma(má) &mmm &mmm ma(má) [=! babbles] .
The first two will be totally ignored in terms of number of words; whereas
the latter two will count as giving the child credit for speaking 1 and 2
real words respectively. Since this is all we care about, we could just not
code any of the & items, correct? But would this be undesirable for
integration to other CHILDES corpora?
Additionally, should all of these 4 examples count as "turns"? Should they
count as "utterances"? Is the last one really a 2-word utterance? (We are
assuming that nothing with & should be counted, as explained in a previous
discussion
https://groups.google.com/forum/#!searchin/chibolts/mlu$20mlt$20/chibolts/KaPRl6SUgTs/d2yTa1qg24gJ)
3) Cross-transcriber reliability. We have many different transcribers, and
so far each transcript is prepared by only one person, so we do not have
any data to draw reliability from. Although they are all looked over by an
experienced transcriber, it is clear that different people make slightly
different decisions as to when to break utterances, how to code
within-sentence repetition, when to code items as being babbling or real
words, and even how and when to mark pronunciation variants in adults'
speech. There is also a lot of variability across transcripts in the amount
of "xxx", and right now we cannot know whether that is a property of the
family (a TV on make it hard for anyone to transcribe) or the transcriber.
So we definitely need some reliability, but how shall we do it, how much
should be recoded, and how good should it be? I know there is a sizable
literature on reliability, but what I'd like to know is, for the corpus to
be a valuable contribution to CHILDES, what should we aim for? In realistic
terms, what do CHILDES users and contributors expect/do?
4) Prime examples in CHILDES. Coder reliability will allow us to make our
data internally consistent, but still doesn't assure us that we're doing it
at the same level of quality and/or with the same parameters as others in
CHILDES. Is there a corpus that seems to embody the ideal of CHAT, which we
could use as benchmark? We are particularly interested in Spanish, because
that is accessible to all our coders, but if there is an English corpus
that people believe is a great example to train oneself with, we would
welcome the recommendation. (To be frank, I did look at the Spanish ones,
but it wasn't clear to me that they were following e.g. the 2-out-of-3
rule, and other CHAT guidelines concerning within-sentence repetition,
etc... But perhaps I'm not looking at the right things, or in the right
way.)
If any of the above are topics you would also like to discuss broadly,
reply to the list, but if not, please don't hesitate to reply to us
privately, and we'll post an anonymized general response next week.
Thank you all in advance for your candid responses,
Alex Cristia
--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To post to this group, send email to chibolts at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/a73ada5f-b8f8-4995-a467-566503e10d78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20150922/78228150/attachment.htm>
More information about the Chibolts
mailing list