[Corpora-List] On tools for indexing and searching large corpora

Wed Nov 20 17:41:32 UTC 2002

Dear Serge,

If you have a valid XML-encoded corpus (and, basically, if you want to
check if it is valid XML), regexes are not the best tool: you could
consider using a parser, and for efficiency a C parser. This allow you
to keep Perl as your main language since wrappers for C libraries (and
the best of them) exist in Perl, like LibXML (wrapper for the libxml
librarie, which now seem to be a real SAX parser, i.e. didn't buffer
the whole string), and XML::SAX::Expat [2] (moving the James Clark
Expat library into SAX2 idiom). (Both available on CPAN).

If you use XSLT/XPath, which is the best way to use powerful (and
standard) query language without reinventing the wheel, you could
consider to use a Splitter and a Merger SAX handler to split your
document on middle-sized units (like <text> in TEI), bufferise the
chunk, and process them with a XSLT proc (which is easy with
XML::LibXML and XML::LibXSLT, see XML::Filter::XSLT on CPAN as an
example of a XSLT filter in a SAX handler).

Another solution to consider is to store your TEI-XML document into a
native XML DB. Sleepy cat (Berkeley DB XML) is no doubt helpful (a new
alpha is just release), allowing to process XPath query on very large
corpora. But I'm wondering (without more test) if the size of the
index needed by deeply-anotated corpus didn't simply replace the
problem of memory consumption in bufferisation (XPath, XSLT) approach.

Berkeley DB XML: http://www.sleepycat.com/xml/index.html

Please let me know your choice.
Regards,

Sylvain Loiseau

----- Original Message -----
From: "Serge Sharoff" <sharoff at aha.ru>
To: <corpora at lists.uib.no>
Sent: Tuesday, November 19, 2002 12:03 PM
Subject: [Corpora-List] On tools for indexing and searching large
corpora

> Dear all,
>
> I'm in the process of compiling a corpus of modern Russian
comparable to the
> BNC in its size and coverage. The format of the corpus is based on
TEI, for
> instance,
> <s id="nashi.535">
> ...
>    <w>глава
>       <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
>       <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
>    </w>
>    <w>Владивостока
>       <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
>    </w>
> ...
> </s>
> in the first case, the POS tagger detects and cannot resolve an
ambiguity
> between two possible readings (masc, animate, i.e. the head of, and
fem.,
> inanimate, i.e. the chapter of), so both analyses are left.
>
> Currently for searching the corpus I use custom tools written in
Perl and
> based on regular expressions. As the corpus gets larger (currently
40
> million words), the indexing scheme gets totally inefficient and I'm
> reluctant to reinvent the wheel by improving it.
>
> What is the technology used in the BNC and other annotated corpora
of
> similar size? Can it be applied in this case (given the need to cope
with
> possible ambiguity)?  The corpus uses Win-1251 encoding, but
eventually I
> plan to convert it to Unicode. Any suggestions?
>
> Best,
> Serge
>
>
>
>
>