[Corpora-List] automatic search for orthographic recurring patterns
Tylman Ule
ule at sfs.uni-tuebingen.de
Mon Dec 13 08:06:13 UTC 2004
Dear Marc,
If you are not interested in morphologically exact analyses, but rather in
substrings of words that might carry some meaning (or have some function),
you might find the enclosed small program useful.
It proved to be quite useful to segment German words from the medical domain
into meaningful substrings. Those words tend to be quite long, and to
contain morphemes too special to be included into standard morphological
resources.
The idea is to map shorter words onto longer words of an input list, and to
consider the differences (read: affixes) as extensions to the input list
(this idea was actually not mine but Jorn Veenstra's). Restricting the
affixes to letter n-grams with roughly n>3 tends to yield quite meaningful
substrings. Iterating finds internal substrings, too.
An example would be the string "abriß" participating as prefix or suffix in
many terms (but not seen alone in the input list):
abriß * harnleiter-(0) iris-(0) urethra-(0) frenulum-(0) augapfel-(0)
meniskus-(0) ureter-(0) nagel-(0) harnröhren-(1) ziliarkörper-(1) bulbus-(1)
leiter-(1) mesenterial-(1) nierenstiel-(1) körper-(1)
Please see the included short documentation for more info.
Hope that helps,
Tylman
Am Mittwoch, 8. Dezember 2004 09:38 schrieben Sie:
> I would like
> to determine recurring orthographic patterns whether initial, i.e.
> "CARPO" (carpogenic, carpogenous, carpolite), final i.e. "IONALISM"
> (sensationalism, functionalism, etc.) , or internal, i.e. "CHRON"
> (synchony, synchronize, etc.).
--
Tylman Ule, Tel. *Okt.-Dez. 0251/83-31984*, Fax 07071/551335
Seminar für Sprachwissenschaft, Universität Tübingen
Haußerstraße 11, 72076 Tübingen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: morphstrings.pl
Type: application/x-perl
Size: 7403 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20041213/65e213c6/attachment-0001.bin>
More information about the Corpora
mailing list