[Corpora-List] International Phonetic Alphabet transcription tool / software

Mike Maxwell maxwell at umiacs.umd.edu
Wed Mar 6 17:21:40 UTC 2013


On 3/6/2013 9:45 AM, Matías Guzmán wrote:
> Mario, I doubt there is anything that can do what you want. As I understand it, speech
> recognition systems depend to a great degree of a language model that predicts what word could
> come next, and then try to match what the speaker said to a database. I don't see how a program
> could transcribe for you if the /t/ is dental or alveolar, I don't even think speech recognition
> software do a decent job working on single syllables. But I'm not an expert, maybe someone can
> correct me.

I share both your skepticism about the usefulness of speech recognition for phonetic transcription, 
and the fact that I'm not an expert in this area (far from it) either.

That said, when I hear speech recognition experts discuss the difficulties in their fields, it 
appears that many of the problems they face come from some issues that should be less relevant to 
phonetic transcription.  First, the word boundary problem: In order to do ASR of a language, 
converting the speech to the standard writing system of the language, you have to figure out where 
the word boundaries are.  This is one of the things the language model gives you, as well as uses to 
determine the likelihood of two different sequences.  (I believe this is true even for languages 
where word boundaries are not marked in the orthography, because the language model still has to 
convert the speech stream to a sequence of likely words in order to choose among alternative sequences.)

In phonetic transcription, on the other hand, you can't worry about the word boundaries, because for 
the most part (except for utterance boundaries and perhaps hesitation points), they simply don't 
exist in the speech stream--so you *can't* do anything about them.

Another problem that practical ASR systems face, which is probably irrelevant for a phonetic 
transcription system, is the phonetic reduction that happens in ordinary speech.  This is mostly not 
what a phonologist would think of as allophonic variation, rather it's the wholesale omission of 
vowels or consonants, or the merger of sequences into a single element, with the result that many 
words have multiple pronunciations.  The English word 'probably', for example, is often pronounced 
s.t. like [prabbly] (geminate [b]) or [prably]; in Spanish, 'necesario' ("necessary") is often 
pronounced [nessario] or [nesario].

But in phonetic transcription, you presumably *want* to transcribe these variants as they are 
pronounced, rather than as their normalized orthographic forms.

Another issue for practical ASR, which might not be a problem for phonetic transcription, is that 
ASR is often intended to work with noisy signals: telephones, cell phones, background noise, bad 
microphones, etc.  Whereas if you have a decent place to do your recording for phonetics (no 
roosters in the background...), and you invest in a good microphone and recorder, you have far less 
noise to deal with.

To the extent that these issues do not arise in phonetics, phonetic transcription might actually be 
easier than ASR.  Of course there are other problems that arise in phonetic transcription which 
don't arise in ASR; the range of possible phones across languages is orders of magnitude greater 
than the number of phonemes in any one language.  The obvious (partial) solution to that is to 
determine the range of actual phones in the target language, either manually or through machine 
learning.  There will likely still be more phones than there would be phonemes, but at least the 
problem space will have been much reduced.

I'm also unsure how well tone transcription would work in a phonetic system, assuming you're working 
with a tone language.  The last I heard, most ASR systems more or less ignore tone; and trying to 
deal with speaker variation, intonation, and so forth makes tone recognition much more challenging 
than even vowel recognition (vowel formants do vary among speakers).

Finally, I'm not sure the OP was interested in complete phonetic transcription.  If the problem was 
discovering dialectal variation, then perhaps there are methods that would involve looking at where 
a standard ASR system for the language was less certain about the word it was hearing.  Most ASR 
systems have a notion of certainty, at least under the hood; one source of uncertainty would 
presumably be dialectal variants, and that could be harnessed to give alternative analyses.
-- 
     Mike Maxwell
     maxwell at umiacs.umd.edu
     "My definition of an interesting universe is
     one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list