Corpora: Creating wordlists / 2-5 word clusters / **freq = 1**

Mark Davies mdavies at ilstu.edu
Tue Apr 3 12:12:54 UTC 2001


Can anyone recommend a PC-based program that creates wordlists with the
following three characteristics:

1) 2 / 3 / 4 / 5 word clusters
2) ** clusters that occur as little as just one time **
3) wordlists of multi-million word texts (can do smaller chunks and merge
them together)

For my present needs, #2 is the most important.  I've been using WordSmith,
and it can of course create wordlists of word clusters, but purposely
limits the lists to only those clusters that occur two times or more.  (In
Settings / Min/Max Frequencies / Word Frequency you can set it as low as 1,
but for 2+ word clusters it won't actually return any clusters with a
frequency less than 2).  This limitation does makes sense, since the number
of clusters that occur only once will be extremely large -- easily in the
millions of distinct strings for 4-5 word clusters.  Nevertheless, for a
project that I am doing, this is (unfortunately) exactly what I need to do.

Thanks in advance for your help.

Mark Davies

=======================================
Mark Davies, Associate Professor, Spanish Linguistics
http://mdavies.for.ilstu.edu/

"Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?"
-- T.S. Eliot

4300 Foreign Languages
Illinois State University
Normal, IL 61790-4300
Voice:309/438-7975 / Fax:309/438-8038
=======================================



More information about the Corpora mailing list