[Corpora-List] International Phonetic Alphabet transcription tool / software
Mike Maxwell
maxwell at umiacs.umd.edu
Wed Mar 6 17:21:40 UTC 2013
On 3/6/2013 9:45 AM, Matías Guzmán wrote:
> Mario, I doubt there is anything that can do what you want. As I understand it, speech
> recognition systems depend to a great degree of a language model that predicts what word could
> come next, and then try to match what the speaker said to a database. I don't see how a program
> could transcribe for you if the /t/ is dental or alveolar, I don't even think speech recognition
> software do a decent job working on single syllables. But I'm not an expert, maybe someone can
> correct me.
I share both your skepticism about the usefulness of speech recognition for phonetic transcription,
and the fact that I'm not an expert in this area (far from it) either.
That said, when I hear speech recognition experts discuss the difficulties in their fields, it
appears that many of the problems they face come from some issues that should be less relevant to
phonetic transcription. First, the word boundary problem: In order to do ASR of a language,
converting the speech to the standard writing system of the language, you have to figure out where
the word boundaries are. This is one of the things the language model gives you, as well as uses to
determine the likelihood of two different sequences. (I believe this is true even for languages
where word boundaries are not marked in the orthography, because the language model still has to
convert the speech stream to a sequence of likely words in order to choose among alternative sequences.)
In phonetic transcription, on the other hand, you can't worry about the word boundaries, because for
the most part (except for utterance boundaries and perhaps hesitation points), they simply don't
exist in the speech stream--so you *can't* do anything about them.
Another problem that practical ASR systems face, which is probably irrelevant for a phonetic
transcription system, is the phonetic reduction that happens in ordinary speech. This is mostly not
what a phonologist would think of as allophonic variation, rather it's the wholesale omission of
vowels or consonants, or the merger of sequences into a single element, with the result that many
words have multiple pronunciations. The English word 'probably', for example, is often pronounced
s.t. like [prabbly] (geminate [b]) or [prably]; in Spanish, 'necesario' ("necessary") is often
pronounced [nessario] or [nesario].
But in phonetic transcription, you presumably *want* to transcribe these variants as they are
pronounced, rather than as their normalized orthographic forms.
Another issue for practical ASR, which might not be a problem for phonetic transcription, is that
ASR is often intended to work with noisy signals: telephones, cell phones, background noise, bad
microphones, etc. Whereas if you have a decent place to do your recording for phonetics (no
roosters in the background...), and you invest in a good microphone and recorder, you have far less
noise to deal with.
To the extent that these issues do not arise in phonetics, phonetic transcription might actually be
easier than ASR. Of course there are other problems that arise in phonetic transcription which
don't arise in ASR; the range of possible phones across languages is orders of magnitude greater
than the number of phonemes in any one language. The obvious (partial) solution to that is to
determine the range of actual phones in the target language, either manually or through machine
learning. There will likely still be more phones than there would be phonemes, but at least the
problem space will have been much reduced.
I'm also unsure how well tone transcription would work in a phonetic system, assuming you're working
with a tone language. The last I heard, most ASR systems more or less ignore tone; and trying to
deal with speaker variation, intonation, and so forth makes tone recognition much more challenging
than even vowel recognition (vowel formants do vary among speakers).
Finally, I'm not sure the OP was interested in complete phonetic transcription. If the problem was
discovering dialectal variation, then perhaps there are methods that would involve looking at where
a standard ASR system for the language was less certain about the word it was hearing. Most ASR
systems have a notion of certainty, at least under the hood; one source of uncertainty would
presumably be dialectal variants, and that could be harnessed to give alternative analyses.
--
Mike Maxwell
maxwell at umiacs.umd.edu
"My definition of an interesting universe is
one that has the capacity to study itself."
--Stephen Eastmond
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list