[Corpora-List] On tools for indexing and searching large corpora

Tue Nov 19 11:03:59 UTC 2002

Dear all,

I'm in the process of compiling a corpus of modern Russian comparable to the
BNC in its size and coverage. The format of the corpus is based on TEI, for
instance,
<s id="nashi.535">
...
   <w>глава
      <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
      <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
   </w>
   <w>Владивостока
      <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
   </w>
...
</s>
in the first case, the POS tagger detects and cannot resolve an ambiguity
between two possible readings (masc, animate, i.e. the head of, and fem.,
inanimate, i.e. the chapter of), so both analyses are left.

Currently for searching the corpus I use custom tools written in Perl and
based on regular expressions. As the corpus gets larger (currently 40
million words), the indexing scheme gets totally inefficient and I'm
reluctant to reinvent the wheel by improving it.

What is the technology used in the BNC and other annotated corpora of
similar size? Can it be applied in this case (given the need to cope with
possible ambiguity)?  The corpus uses Win-1251 encoding, but eventually I
plan to convert it to Unicode. Any suggestions?

Best,
Serge