Corpora: Corpus indexing program

E.S. esantoro at poczta.onet.pl
Sat Jun 1 10:53:16 UTC 2002


Can anyone direct me to a corpus indexing program that does fast
searches.  I have dabbled in Wordsmith and Winconcord for Windows, but
neither does a complete index of my entire database of text,
approximately 2 GB, and both seem to take about 20 minutes on a Pentium
233 for one search.

My database is a collection of texts on U.S. and British literature,
history, and culture; linguistics; writing studies; history of English;
philosophy; critical and cultural theory; psychology.  The database also
includes daily postings from about 20 listservs and several newspapers
and journals, to include corpus linguistics, media studies, and the
other disciplines already mentioned.

This database serves two purposes for me and my students: a somewhat
customized research database of scanned and WWW material and an
extensive searchable corpus for language research (I believe I have a
much better collection of texts than does either the BNC or Collins on
the WEB. However my collection consists of at least 85% professional
American-English and is not tagged, as you will see below.)

Currently I use Asksam 4 and 5 and Adobe Acrobat 4 and 5 to search this
2 GB database.  Adobe Acrobat will accomplish any search in 15 seconds
or less.  It's great for locating information, and quite useful for
looking at words and phrases in context, though it doesn't give any
empirical data.  But the whole database is indexed and searches are very
fast, and the current database can always easily be transferred to some
other system.

I also use Asksam 4 and 5.  It, too, indexes my entire database, but it
has the advantage of being able to do more complex proximity searches,
so any permutation is possible.  The only drawback is it's slowness
(perhaps a minute or two on a P-233 machine for any complex search)and
that it, too, doesn't give empirical data.  At least Adobe Acrobat will
yield a list of all files that contain instances of the queried string.
Asksam on the other hand yields one file at a time.


I would be grateful if anyone can point me toward a program that is
combination of the database programs I am already using and a bona fide
corpus program.


Thanks for the consideration,


Edward

PWSZ/NKJO Poland



More information about the Corpora mailing list