[Corpora-List] automatic search for orthographic recurring patterns

William Fletcher fletcher at usna.edu
Wed Dec 8 11:29:15 UTC 2004


Hello Marc,

For my "Phrases in English" site where I have all "char-grams" of 1-3 in
the BNC tallied by initial, medial and final position
  http://pie.usna.edu/explorec.html
I proceeded as follows:

- normalize and tokenize the corpus and tally the tokens

- take all types above a given frequency cutoff (I believe I used 15, to
avoid foreign sequences in non-English names etc.) and output a list of
types and frequencies

- In view of memory constraints (with higher values of n you get a lot
of unique chargrams), I made one pass for each combination of position
and number of characters as follows:

   - initialize an "associative array" to tally the frequency of each
chargram (I used the Windows dictionary object with PowerBasic)

   - read in the list of types and frequencies

   - break up each type into chargrams and add its frequency to the
frequency of that chargram in that position, e.g. for the type "corpus"
and a value of 2,
     "initial" pass:  co
     "medial" pass:  or rp pu
     "final" pass:  us

  -  sort the array in reverse frequency order and output all chargrams
that met my threshold

  -  loop back and do next combination of position and number

(I used a "quick and dirty" ad-hoc implementation for PIE which could
easily be adapted for command-line use.  "Someday" I may integrate this
capability into kfNgram to give it a nicer interface.)

Hope this helps,
Bill Fletcher


>>> MARC FRYD <marc.fryd at univ-poitiers.fr> 12/08/04 3:38 AM >>>
Hi,
Perhaps someone on the List will be able to help me with the following
datamining problem:

Given a corpus of isolated lexical units or collocations, I would like
to determine recurring orthographic patterns whether initial, i.e.
"CARPO" (carpogenic, carpogenous, carpolite), final i.e.  "IONALISM"
(sensationalism, functionalism, etc.) , or internal, i.e. "CHRON"
(synchony, synchronize, etc.).
The output should be arranged so as to show respective productivity for
each pattern.
Important constraint: the various patterns will *not* be fed in
initially but should be extracted as a result of the algorithm.
I'll post a summary if I get several replies.
Regards to all list members.
Marc Fryd



More information about the Corpora mailing list