[Corpora-List] Apache Lucene

Wed May 23 08:20:21 UTC 2007

yes, lucene is a full-text retrieval toolbox.
but at least it supports several fields to be indexed which are 
associated with the same document. but you don't get the link between 
words and their POS tags in that way ....

but you could attach POS-tags to words and create two different 
fields, one with words only and one with words and their POS tags 
attached. you can easily create queries searching both fields 
together. you get a lot of redundancy in the database but lucene 
can handle large datasets. I did something similar for our QA system 
including several fields with various annotation data on a corpus of 
>70 million words. no problem at all.

I saw another IR toolbox that supports annotation data: Lemur
in the list of features (http://www.lemurproject.org/features.php) 
they state:

...
# Indexes inline and offset text annotations (e.g., part-of-speech and 
named entities)
# Indexes document attributes

I haven't tried it yet but it sounds promising. maybe your IT people 
are happy with this one?

good luck!

Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********