[Corpora-List] Looking for a corpus management tool

Shuly Wintner shuly at cs.haifa.ac.il
Thu Oct 15 15:58:44 UTC 2009


Hi,

> We're developing a large, diverse corpus of (written) Modern Hebrew,
> along with morphological processing tools. In order to facilitate
> access to the corpus, and in particular to make it usable for
> linguistic research, we're looking for a corpus management tool that
> supports as many as possible of the following features:

I was impressed by the number of responses and the vast array of  
potential solutions. It would take us some time to evaluate the  
various platforms, so I list below the responses I received without  
any evaluation of their compliance with our needs. Many thanks to all  
who responded.

Shuly

On Oct 14, 2009, at 13:23 , Diana Maynard wrote:

> GATE should be able to fulfil pretty much all those requirements.  
> The IR features are a bit limited but you can add your own plugins  
> as necessary.
> http://gate.ac.uk for more info and download

On Oct 14, 2009, at 15:35 , Siva Reddy wrote:

> SketchEngine (http://www.sketchengine.co.uk) has some of the  
> facilities you require. But I don't know if it can handle right-to- 
> left script. You can confirm it.

On Oct 14, 2009, at 17:10 , Janne Bondi Johannessen wrote:

> Our corpus system Glossa is freely downloadable, and can do a good  
> many things. It is based on a combination of Corpus Workbench (CWB)  
> and MySQL.
> CWB is used to search for word forms, lemmas and morphological
> information, while metadata (e.g., author, publisher, etc.) is stored
> in a MySQL database. Users may annotate search results, and these
> annotations are also stored in the database. The system consists
> mainly of CGI scripts written in Perl, with a few parts written in
> PHP. Collocation lists can be generated based on a wide range of
> statistical measures (mutual information, log-likelihood ration, Dice
> coefficient, etc.) through the use of a Perl module written by Ted
> Pedersen and others. Glossa currently handles up to approx. 30 million
> tokens in a single corpus.
> More information can be found here: http://www.hf.uio.no/tekstlab/English/glossa.html

On Oct 14, 2009, at 19:29 , Eric Atwell wrote:

> See http://corpus.leeds.ac.uk/internet.html - Serge Sharoff has put
> together website and perl tools whcih cover most of your needs:
>
> - Corpora of hundreds of millions of words, in several languages
> - PoS-tagged, tho not with multiple analyses
> - Arabic included, UTF-8 right-to-left script (but not Hebrew (yet))
> - various concordance options, also collocations, by mutual info /
>   T-score / log-likelihood
> - open-source and easy to adapt IF you like perl

On Oct 14, 2009, at 21:35 , Bruce Anderson wrote:

> If I were facing your challenges, I'd be looking into building  
> something on top of the NLTK (Natural Language Toolkit).  I suspect  
> that you are hoping to find something that would involve a lot less  
> programming.  If you don't find the right tool, check out NLTK. (www.nltk.org 
> )
>
> All of the features you mention could be fairly readily implemented  
> - provided you have access to some (PYTHON) programming talent.

On Oct 15, 2009, at 03:06 , Adam Kilgarriff wrote:

> Sketch Engine meets all the criteria except 'open source or freely  
> available'.

On Oct 15, 2009, at 09:01 , Adam Przepiorkowski wrote:

> Poliqarp (http://poliqarp.sourceforge.net/) satisfies all these
> requirements, perhaps apart from the last one:
>
>> - Easy to maintain (specifically, add texts, change morphological
>> annotation, add search options)
>
> It's a typical product of a scientific project, so you may find
> documentation somewhat lacking, but we use it for various Polish
> corpora, including the currently 500 million-token morphosyntactically
> encoded demo of the National Corpus of Polish
> (http://nkjp.pl/index.php?page=6&lang=1, see "IPI PAN Search Engine
> for NKJP data"; you'll find "query syntax" there), which will grow to
> 1 billion within a year or so.  Poliqarp is also used by Anotonio
> Branco's group for their corpus of Portuguese.
>
> I am not sure whether the right-to-left script is an issue -- you tell
> me ;-)






_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list