[Corpora-List] Apache Lucene
Joerg Tiedemann
tiedeman at let.rug.nl
Wed May 23 08:20:21 UTC 2007
yes, lucene is a full-text retrieval toolbox.
but at least it supports several fields to be indexed which are
associated with the same document. but you don't get the link between
words and their POS tags in that way ....
but you could attach POS-tags to words and create two different
fields, one with words only and one with words and their POS tags
attached. you can easily create queries searching both fields
together. you get a lot of redundancy in the database but lucene
can handle large datasets. I did something similar for our QA system
including several fields with various annotation data on a corpus of
>70 million words. no problem at all.
I saw another IR toolbox that supports annotation data: Lemur
in the list of features (http://www.lemurproject.org/features.php)
they state:
...
# Indexes inline and offset text annotations (e.g., part-of-speech and
named entities)
# Indexes document attributes
I haven't tried it yet but it sounds promising. maybe your IT people
are happy with this one?
good luck!
Jörg
***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Jörg Tiedemann tiedeman at let.rug.nl **
** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
** 9712 EK Groningen fax: +31 (0)50-363 6855 **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********
More information about the Corpora
mailing list