[Corpora-List] On tools for indexing and searching large corpora

Olonichev Sergei olonichev at scnsoft.com
Thu Nov 21 08:28:25 UTC 2002


> Dear Serge,
>
> If you have a valid XML-encoded corpus (and, basically, if you want to
> check if it is valid XML), regexes are not the best tool: you could

Regexes always have been expressive for lingustic queries.

The search speed depends on index implementation.
You may have word-based index and may increase the regexp search speed
drastically, e.g.
you would like to find the construction: " word1 .+ word2 ",
so the query should be: echo "word1 & word2" | mgquery | grep -i " word1 .+
word2 "

[skipped]

> Berkeley DB XML: http://www.sleepycat.com/xml/index.html
>
> Please let me know your choice.
> Regards,
>
> Sylvain Loiseau
>
>
>
>
> ----- Original Message -----
> From: "Serge Sharoff" <sharoff at aha.ru>
> To: <corpora at lists.uib.no>
> Sent: Tuesday, November 19, 2002 12:03 PM
> Subject: [Corpora-List] On tools for indexing and searching large
> corpora
>
>
> > Dear all,
> >
> > I'm in the process of compiling a corpus of modern Russian
> comparable to the
> > BNC in its size and coverage. The format of the corpus is based on
> TEI, for
> > instance,
> > <s id="nashi.535">
> > ...
> >    <w>глава
> >       <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
> >       <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
> >    </w>
> >    <w>Владивостока
> >       <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
> >    </w>
> > ...
> > </s>
> > in the first case, the POS tagger detects and cannot resolve an
> ambiguity
> > between two possible readings (masc, animate, i.e. the head of, and
> fem.,
> > inanimate, i.e. the chapter of), so both analyses are left.
> >
> > Currently for searching the corpus I use custom tools written in
> Perl and
> > based on regular expressions. As the corpus gets larger (currently
> 40
> > million words), the indexing scheme gets totally inefficient and I'm
> > reluctant to reinvent the wheel by improving it.
> >
> > What is the technology used in the BNC and other annotated corpora
> of
> > similar size? Can it be applied in this case (given the need to cope
> with
> > possible ambiguity)?  The corpus uses Win-1251 encoding, but
> eventually I
> > plan to convert it to Unicode. Any suggestions?
> >
> > Best,
> > Serge
> >
> >
> >
> >
> >
>
>



More information about the Corpora mailing list