[Corpora-List] On tools for indexing and searching large corpora

Sun Dec 8 15:29:23 UTC 2002

Dear all,

Some time ago I sent a query to the Corpora List on the topic of indexing
and searching tools for BNC-like corpora (of about 100 MW). The reason for
the query is that using a 100 MW corpus without a reasonably fast and
compact indexing scheme is a nightmare. (the original query is listed at the
end of the message).

The responses I got can be summarized in three categories:
1. a relational database can be used. The corpus in XML is converted into a
set of tables (a database can have tools for importing XML files or corpora
can be preprocessed for importing them as plain texts).  Queries to the
database are based on SQL (possibly with a more user-friendly interface, an
example of this approach is the Spanish corpus by Mark Davies
http://www.corpusdelespanol.org). Another possibility is to use the Berkley
DB (http://www.sleepycat.com/xml/) which can load XML documents and uses
XPath as the query language (now in the alpha release);
2. the IMS Corpus WorkBench can be used. It can handle 300+ MW corpora
successfully, though it uses a specific input format (not TEI) and it is
unclear, whether and how it can handle the ambiguity in annotations
(multiple <ana> tags). This is also the software that works with the Uppsala
Corpus (http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html),
3. the new BNC indexer, which is designed to work with any tagging scheme.
Now it is in its testing phase. By definition, it is aimed at handling very
large corpora and uses SARA as the query interface
(http://www.hcu.ox.ac.uk/SARA).

I'd prefer the third option, when it is available, though other options can
be useful, depending on your corpus.  I tested the alpha release of Berkley
DB XML on my 40 MW corpus. It seems that it copes well with megaword data
and Unicode characters.

Many thanks for responses from
Lou Burnard <lou.burnard at computing-services.oxford.ac.uk>
Mark Davies <mdavies at ilstu.edu>
Arne Fitschen <fitschen at ims.uni-stuttgart.de>
Sylvain Loiseau <sylvain at toucheraveclesyeux.com>
Sergei Olonichev <olonichev at scnsoft.com>

Best wishes,
Serge

----- Original Message -----
From: Serge Sharoff <sharoff at aha.ru>
To: <corpora at lists.uib.no>
Sent: Tuesday, November 19, 2002 2:03 PM
Subject: [Corpora-List] On tools for indexing and searching large corpora

> Dear all,
>
> I'm in the process of compiling a corpus of modern Russian comparable to
the
> BNC in its size and coverage. The format of the corpus is based on TEI,
for
> instance,
> <s id="nashi.535">
> ...
>    <w>глава
>       <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
>       <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
>    </w>
>    <w>Владивостока
>       <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
>    </w>
> ...
> </s>
> in the first case, the POS tagger detects and cannot resolve an ambiguity
> between two possible readings (masc, animate, i.e. the head of, and fem.,
> inanimate, i.e. the chapter of), so both analyses are left.
>
> Currently for searching the corpus I use custom tools written in Perl and
> based on regular expressions. As the corpus gets larger (currently 40
> million words), the indexing scheme gets totally inefficient and I'm
> reluctant to reinvent the wheel by improving it.
>
> What is the technology used in the BNC and other annotated corpora of
> similar size? Can it be applied in this case (given the need to cope with
> possible ambiguity)?  The corpus uses Win-1251 encoding, but eventually I
> plan to convert it to Unicode. Any suggestions?
>
> Best,
> Serge
>
>
>
>
> __________
> Некоторые падают, а некоторые нет - http://www.newhost.ru
>