Corpora: Syntactic/Phonologic network?

Mike Maxwell maxwell at ldc.upenn.edu
Wed Jan 23 13:36:52 UTC 2002


Yuval Feinstein wrote:
>[Are there]...networks according to
>phonological information?
>(e.g..."fish" and "wish" are similar
>phonologically)

A minimized Finite State Automaton (FSA) has some of the properties you
mention, i.e. they constitute a network based on spelling similarity (or
phonological similarity, if you spell words "phonemically").  There was an
article about how minimized FSAs can be constructed in a recent issue of
Computational Linguistics.  However,

(1) There's no guarantee (and indeed, probably no way) that all phonological
similarities are captured.  For instance, how would you store the
similarities betweeen "finish" and "fish"?  Without a theory (e.g. codas are
more 'important' than onsets), it would be difficult to decide between two
similarities, if only one can be represented in the network.  (In this case,
only one of the 'i's of "finish" can correspond to the "i" of "fish".)

(2) There's no obvious way to extract the similarities that are implicit in
a minimized FSA, short of asking for the intersection of the FSA with a list
of regular expressions constructed according to some minimal distance
algorithm.  E.g. if you want to find words similar to "fish", you would have
to intersect the FSA with expressions like "?fish", "f?ish" etc. (for
single-point insertions), "?ish", "f?sh" etc. (for single-point
replacements), "ish", "fsh" etc. (for single-point deletions), and "ifsh",
fsih" etc. (for metathesis).

Rhyming dictionaries operate do s.t. like (1), under the theory that codas
(and, if I recall, stress patterns) are more important than other factors.
And if you're only interested in English or some other "commercially viable"
language, spell checkers do something like (2).  Of course they're more
concerned with the kinds of errors that arise from spelling conventions than
with sound, so e.g. the fact that the 'esh' sound is written in English with
two letters gives you a possible spelling error ("fsih") that has no basis
in phonology.

     Mike Maxwell
     Linguistic Data Consortium
     maxwell at ldc.upenn.edu



More information about the Corpora mailing list