[Corpora-List] On tools for indexing and searching large corpora

Pavel Rychly pary at textforge.cz
Fri Nov 22 00:41:43 UTC 2002


On Tue, Nov 19, 2002 at 02:03:59PM +0300, Serge Sharoff wrote:
> What is the technology used in the BNC and other annotated corpora of
> similar size? Can it be applied in this case (given the need to cope with
> possible ambiguity)?  The corpus uses Win-1251 encoding, but eventually I
> plan to convert it to Unicode. Any suggestions?

At the NLPlab of FI MU, Brno, Czech Republic, the Manatee system is in
regular use.  We use corpora (including BNC) of many different
languages and encodings.  Even the largest Czech corpus (more than 620
million tokens) has ambiguous lemma and grammatical annotation.  The
Manatee handles pretty well both ambiguity and large size of corpora.
The Manatee system is available from www.textforge.cz

Best
Pavel



More information about the Corpora mailing list