[Corpora-List] Looking for a corpus management tool
Shuly Wintner
shuly at cs.haifa.ac.il
Thu Oct 15 15:58:44 UTC 2009
Hi,
> We're developing a large, diverse corpus of (written) Modern Hebrew,
> along with morphological processing tools. In order to facilitate
> access to the corpus, and in particular to make it usable for
> linguistic research, we're looking for a corpus management tool that
> supports as many as possible of the following features:
I was impressed by the number of responses and the vast array of
potential solutions. It would take us some time to evaluate the
various platforms, so I list below the responses I received without
any evaluation of their compliance with our needs. Many thanks to all
who responded.
Shuly
On Oct 14, 2009, at 13:23 , Diana Maynard wrote:
> GATE should be able to fulfil pretty much all those requirements.
> The IR features are a bit limited but you can add your own plugins
> as necessary.
> http://gate.ac.uk for more info and download
On Oct 14, 2009, at 15:35 , Siva Reddy wrote:
> SketchEngine (http://www.sketchengine.co.uk) has some of the
> facilities you require. But I don't know if it can handle right-to-
> left script. You can confirm it.
On Oct 14, 2009, at 17:10 , Janne Bondi Johannessen wrote:
> Our corpus system Glossa is freely downloadable, and can do a good
> many things. It is based on a combination of Corpus Workbench (CWB)
> and MySQL.
> CWB is used to search for word forms, lemmas and morphological
> information, while metadata (e.g., author, publisher, etc.) is stored
> in a MySQL database. Users may annotate search results, and these
> annotations are also stored in the database. The system consists
> mainly of CGI scripts written in Perl, with a few parts written in
> PHP. Collocation lists can be generated based on a wide range of
> statistical measures (mutual information, log-likelihood ration, Dice
> coefficient, etc.) through the use of a Perl module written by Ted
> Pedersen and others. Glossa currently handles up to approx. 30 million
> tokens in a single corpus.
> More information can be found here: http://www.hf.uio.no/tekstlab/English/glossa.html
On Oct 14, 2009, at 19:29 , Eric Atwell wrote:
> See http://corpus.leeds.ac.uk/internet.html - Serge Sharoff has put
> together website and perl tools whcih cover most of your needs:
>
> - Corpora of hundreds of millions of words, in several languages
> - PoS-tagged, tho not with multiple analyses
> - Arabic included, UTF-8 right-to-left script (but not Hebrew (yet))
> - various concordance options, also collocations, by mutual info /
> T-score / log-likelihood
> - open-source and easy to adapt IF you like perl
On Oct 14, 2009, at 21:35 , Bruce Anderson wrote:
> If I were facing your challenges, I'd be looking into building
> something on top of the NLTK (Natural Language Toolkit). I suspect
> that you are hoping to find something that would involve a lot less
> programming. If you don't find the right tool, check out NLTK. (www.nltk.org
> )
>
> All of the features you mention could be fairly readily implemented
> - provided you have access to some (PYTHON) programming talent.
On Oct 15, 2009, at 03:06 , Adam Kilgarriff wrote:
> Sketch Engine meets all the criteria except 'open source or freely
> available'.
On Oct 15, 2009, at 09:01 , Adam Przepiorkowski wrote:
> Poliqarp (http://poliqarp.sourceforge.net/) satisfies all these
> requirements, perhaps apart from the last one:
>
>> - Easy to maintain (specifically, add texts, change morphological
>> annotation, add search options)
>
> It's a typical product of a scientific project, so you may find
> documentation somewhat lacking, but we use it for various Polish
> corpora, including the currently 500 million-token morphosyntactically
> encoded demo of the National Corpus of Polish
> (http://nkjp.pl/index.php?page=6&lang=1, see "IPI PAN Search Engine
> for NKJP data"; you'll find "query syntax" there), which will grow to
> 1 billion within a year or so. Poliqarp is also used by Anotonio
> Branco's group for their corpus of Portuguese.
>
> I am not sure whether the right-to-left script is an issue -- you tell
> me ;-)
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list