[Corpora-List] Looking for a corpus management tool

Wed Oct 14 11:08:18 UTC 2009

Hi,

We're developing a large, diverse corpus of (written) Modern Hebrew,  
along with morphological processing tools. In order to facilitate  
access to the corpus, and in particular to make it usable for  
linguistic research, we're looking for a corpus management tool that  
supports as many as possible of the following features:

- Ability to store and index millions of texts, hundreds of millions  
of tokens
- Ability to process morphologically analyzed text where each token  
can be associated with multiple analyses, each consisting of  
structured complex information (currently in XML, can be converted to  
SQL)
- Ability to handle UTF8-encoded data, right-to-left script
- GUI for searching the corpus based on meta-information, tokens,  
lemmas, morphological information, etc.
- More information retrieval options than concordance (e.g., retrieve  
collocations, compute mutual information measures, etc.)
- Open source or freely available
- Easy to maintain (specifically, add texts, change morphological  
annotation, add search options)

If you reply to me I'd be happy to send a summary of the responses to  
the List. Many thanks for your help,

Shuly

-- 
Shuly Wintner
Dept. of Computer Science, University of Haifa, 31905 Haifa, Israel
Phone: +972 (4) 8288180  Fax: +972 (4) 8249331
shuly at cs.haifa.ac.il   http://cs.haifa.ac.il/~shuly

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora