[Corpora-List] software for cooccurence/ collocations analysis in german texts

Maarten van Gompel proycon at anaproy.nl
Wed Aug 20 19:18:43 UTC 2014


Quoting Abdoulaye Dramé (2014-08-15 12:16:26)
> I would like to find co-occuring words in  german texts. The number of texts I
> have is about 1 000 000 (one million), with each text having about 10 sentences.
> 
> Does anybody know where I can find a software to do the analysis on such a big
> amount of texts?
> 
> I would prefer a java software but others are also ok provided they run on
> ubuntu.
> 
> Any help would be appreciated.

Hi Abdoulaye,

You can consider using colibri-core for that purpose. It has facilities to find
co-occuring words or patterns, using (normalised) pointwise mutual information.

It is optimised for handling big data, but a machine with lots of memory
is required nevertheless. The software is written in C++, can
be used as a command-line tool, and there is a Python binding as well.

The software resides here:
 https://github.com/proycon/colibri-core

An introductory blog post is here:
 http://proycon.github.io/blog/2014/03/31/colibri-core/

Documentation is here:
 http://proycon.github.io/colibri-core/doc/

A Python tutorial to use the library is here:
 http://proycon.github.io/colibri-core/doc/colibricore-python-tutorial.html

It should run fine on Ubuntu or any other Linux (it was developed in that
platform).

Full disclosure: I developed this software as part of my PhD project and am
still actively maintaining it

Regards,

--

Maarten van Gompel
 Centre for Language Studies
 Radboud Universiteit Nijmegen

proycon at anaproy.nl
http://proycon.anaproy.nl
http://github.com/proycon

GnuPG key:  0x1A31555C  XMPP: proycon at anaproy.nl
Bitcoin:    1BRptZsKQtqRGSZ5qKbX2azbfiygHxJPsd 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list