phonetically transcribed CDS

Margaret Fleck mfleck at cs.uiuc.edu
Tue Dec 5 22:53:00 UTC 2006


I've been working on a closely related question (automatic learning of
word boundaries) and this isn't 100% obvious even for adult non-CDS
speech in English (e.g. the Switchboard corpus).

One issue is that the problem words are more-or-less cliticized
onto their neighbors, with the details depending in subtle ways on
which specific word you are talking about.   The linguistics literature
seems to have attacked only selected aspects of this problem, in a
very incomplete way.

The term "word" is defined by several criteria, which don't pick out
exactly the same boundaries.   So, some authors (especially in the
computational linguistics literature for Chinese) define "word" in
terms of semantic units.   Other authors use it for the domain of
phonological processes or a domain within which you can't (fluently)
pause.

Finally, the conventional spelling of Western languages was established
some time ago and may not completely reflect the current situation.
E.g. the phonological status of compounds or small function words may
have changed.

Personally, I'd suggest trying to approximate the phonological word,
since that's more-or-less well-defined and comparable between English
and (the various dialects of) Chinese.   So, that is, probably
two words in "didja know" because "you" is clearly cliticized.   And
one word in "gimme" for the same reason.   But "feed me" is murky
because I don't hear phonological changes but "me" is often clitic
and it's an unlikely pause location.

No matter what you do, there's going to be a lot of murky cases and
some phonologist will later discover you handled some of them wrong.   The only
thing you can do about that is not worry, be consistent, and write clear
documentation so later researchers can easily re-format your data.

You might see if the folks behind the Buckeye Corpus (out of Ohio State) put
any useful wisdom into their publications/documentation.   They are top-rate
phonologists and this is a current project, so they might have tried to pin some
of this down for adult non-CDS English.

Margaret
    (Margaret Fleck, U. Illinois)

JAN R EDWARDS wrote:
> Hi everyone,
> 
> While we are on the subject of CDS, I have a question also.
> We are working on developing CDS lexicons for several languages
> (English, Greek, Cantonese, Japanese).  Because we are interested
> in phoneme frequency and phoneme sequence frequency, we need
> to phonetically transcribe and segment the mother's (or other
> caregiver's) speech.  This turns out to be somewhat complicated
> in the case of CDS, because we have to make decisions about
> where the word boundaries should be for infants.  For example,
> how many words in "didja know..."  Is anyone else working
> on this or similar questions in English or other languages?
> 
> Yours,
> Jan



More information about the Info-childes mailing list