adult-adult conversation, Santa Barbara corpus
Virginia Valian
vvvstudents at gmail.com
Fri Oct 14 16:51:12 UTC 2011
Dear Colleagues,
I sent a query about sources of adult-adult conversations earlier this year.
My thanks to those of you who responded. Here is a follow-up about what we
did. We settled on the Santa Barbara of spoken American English corpus
(SBCSAE), but we are also looking into the Buckeye corpus.
Information about the SBCSAE can be found here:
http://www.linguistics.ucsb.edu/research/sbcorpus.html
And here:
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, and Thompson, Sandra A.
2000. Santa Barbara corpus of spoken American English, Part 1. Philadelphia:
Linguistic Data Consortium. ISBN 1-58563-164-7.
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson,Sandra A., and
Martey, Nii. 2003. Santa Barbara corpus of spoken
American English, Part 2. Philadelphia: Linguistic Data Consortium. ISBN
1-58563-272-4.
Du Bois, John W., and Englebretson, Robert. 2004. Santa Barbara corpus of
spoken American English, Part 3. Philadelphia: Linguistic
Data Consortium. ISBN 1-58563-308-9.
Du Bois, John W., and Englebretson, Robert. 2005. Santa Barbara corpus of
spoken American English, Part 4. Philadelphia: Linguistic
Data Consortium. ISBN: 158563-348-8.
There were various glitches in the Santa Barbara files that prevented us
from using them as they were. We had to clean them.
The 60 cleaned cha and XML tagged Santa Barbara files that we used are here,
if people want to access them:
http://www.hunter.cuny.edu/littlelinguist/data/SBCSAE/
Paul Feitzinger, the excellent computer scientist in the Language
Acquisition Research Center who cleaned the files, has this to say about how
he proceeded:
- We wanted to quickly tag the SBCSAE and convert it to XML, using
Chatter so that we could run custom analysis scripts on it.
- We removed all occurrences of "ʔ", trailing and compound-joining
"-", and trailing " ' " before tagging.
- After running MOR and POST, we converted all instances of word|?
into word|unk. An appearance of "?" would cause the file to fail CHECK and
break Chatter.
- After some hand disambiguation, the files passed CHECK and could
run through Chatter.
- There was an issue in a couple of spots (e.g., 40.cha: lines 673,
1124) where a "." on the main tier would be represented on the MOR tier
with "none", which CHECK and Chatter rejected.
There are conceptual issues about which examples of adult-adult speech
should be compared with adult-child speech. We have not addressed that
directly. Our comparisons are on-going, but in our *syntactic* analyses of
part-of-speech bigrams, we see little difference between adults talking to
adults and adults talking to children, per our poster at AMLaP in September
of this year:
Quirk, E., Feitzinger, P., Richter, C., Zeitlin, M., Chodorow, M., & Valian,
V. (2011, September). A computational analysis of grammar change and
grammar similarity. Poster presented at AMLaP, Paris, France.
Best wishes,
VVV
--
Virginia Valian
Distinguished Professor
Department of Psychology, Hunter College
PhD Programs in Linguistics, Psychology, and Speech-Language-Hearing
Sciences, CUNY Grad Center
vvvstudents at gmail.com
--
You received this message because you are subscribed to the Google Groups "Info-CHILDES" group.
To post to this group, send email to info-childes at googlegroups.com.
To unsubscribe from this group, send email to info-childes+unsubscribe at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/info-childes?hl=en.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/info-childes/attachments/20111014/cf90fab9/attachment.htm>
More information about the Info-childes
mailing list