[Corpora-List] Algorithm for orthography to IPA conversion in German

Sebastian Nagel wastl.nagel at googlemail.com
Tue May 10 21:16:25 UTC 2011


Hi Thomas,

try this:
  http://tools.webmasterei.com/mbrolatester/
and contact the webmaster who has an amusing blog
with some really interesting NLP stuff.

Personally, a couple of years ago I found the link,
got curious, installed the MBrola (the link suggested) text-to-speech
system (there are packages for Linux). It has a "pipelined"
architecture, so it was quite easy to set up a pipeline which
does the conversion:
 + the heart is txt2pho (http://www.sk.uni-bonn.de/forschung/phonetik/sprachsynthese/txt2pho/)
 + SAMPA to IPA conversion is done via the Perl module CXS from http://www.theiling.de/ipa/

That's a minimalistic script (txt2pho and CXS must be installed) for conversion from the command-line:

#!/bin/bash

TXT2PHO=<path_to_txt2pho>

perl -lpe 'print "." if /^\s*$/; print ".\n";' \
    | recode -f u8..l1 \
    | $TXT2PHO/pipefilt/pipefilt \
    | $TXT2PHO/preproc/preproc $TXT2PHO/preproc/Rules.lst $TXT2PHO/preproc/Hadifix.abk \
    | $TXT2PHO/txt2pho -m -p $TXT2PHO/data/ \
    | perl -pe 'chomp; s/\s.+//; s/^_$//; print "\n" if /^$/;' \
    | perl -MCXS -lne '$ipa=cxs2ipa($_); print $ipa'

Test:
% echo -e "Haus\nHäuser\nChinaapfel\nPhonetik" | txt2ipa.sh

haʊs

hɔʏzɐ

çiːnaː
apfl

foːneːtɪk

As you may see I struggled with the word segmentation.
But the transcription is impressive (I guess but I'm
not quite familiar with phonetics).

Bye,
Sebastian Nagel
(from Konstanz)


On 05/09/2011 10:47 AM, Thomas Schmidt wrote:
> Dear all,
> 
> I am looking for an algorithm / a tool / a set of rules which can help
> me to automatically derive an IPA transcription for an orthographic
> word (i.e. no lexicon lookup). Can anybody help (I'll post a summary)?
> 
> Thanks,
> 
> Thomas
> 


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list