[Corpora-List] automatic search for orthographic recurring patterns

Tylman Ule ule at sfs.uni-tuebingen.de
Mon Dec 13 08:06:13 UTC 2004


Dear Marc,

If you are not interested in morphologically exact analyses, but rather in 
substrings of words that might carry some meaning (or have some function), 
you might find the enclosed small program useful.

It proved to be quite useful to segment German words from the medical domain 
into meaningful substrings.  Those words tend to be quite long, and to 
contain morphemes too special to be included into standard morphological 
resources.

The idea is to map shorter words onto longer words of an input list, and to 
consider the differences (read: affixes) as extensions to the input list 
(this idea was actually not mine but Jorn Veenstra's).  Restricting the 
affixes to letter n-grams with roughly n>3 tends to yield quite meaningful 
substrings.  Iterating finds internal substrings, too.

An example would be the string "abriß" participating as prefix or suffix in 
many terms (but not seen alone in the input list):

abriß * harnleiter-(0) iris-(0) urethra-(0) frenulum-(0) augapfel-(0) 
meniskus-(0) ureter-(0) nagel-(0) harnröhren-(1) ziliarkörper-(1) bulbus-(1) 
leiter-(1) mesenterial-(1) nierenstiel-(1) körper-(1)

Please see the included short documentation for more info.


Hope that helps,
Tylman


Am Mittwoch, 8. Dezember 2004 09:38 schrieben Sie:
> I would like
> to determine recurring orthographic patterns whether initial, i.e.
> "CARPO" (carpogenic, carpogenous, carpolite), final i.e.  "IONALISM"
> (sensationalism, functionalism, etc.) , or internal, i.e. "CHRON"
> (synchony, synchronize, etc.).


-- 
Tylman Ule,  Tel. *Okt.-Dez. 0251/83-31984*, Fax 07071/551335  
  Seminar für Sprachwissenschaft, Universität Tübingen  
  Haußerstraße 11, 72076 Tübingen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: morphstrings.pl
Type: application/x-perl
Size: 7403 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20041213/65e213c6/attachment-0001.bin>


More information about the Corpora mailing list