[Corpora-List] software for cooccurence/ collocations analysis in german texts
Maarten van Gompel
proycon at anaproy.nl
Wed Aug 20 19:18:43 UTC 2014
Quoting Abdoulaye Dramé (2014-08-15 12:16:26)
> I would like to find co-occuring words in german texts. The number of texts I
> have is about 1 000 000 (one million), with each text having about 10 sentences.
>
> Does anybody know where I can find a software to do the analysis on such a big
> amount of texts?
>
> I would prefer a java software but others are also ok provided they run on
> ubuntu.
>
> Any help would be appreciated.
Hi Abdoulaye,
You can consider using colibri-core for that purpose. It has facilities to find
co-occuring words or patterns, using (normalised) pointwise mutual information.
It is optimised for handling big data, but a machine with lots of memory
is required nevertheless. The software is written in C++, can
be used as a command-line tool, and there is a Python binding as well.
The software resides here:
https://github.com/proycon/colibri-core
An introductory blog post is here:
http://proycon.github.io/blog/2014/03/31/colibri-core/
Documentation is here:
http://proycon.github.io/colibri-core/doc/
A Python tutorial to use the library is here:
http://proycon.github.io/colibri-core/doc/colibricore-python-tutorial.html
It should run fine on Ubuntu or any other Linux (it was developed in that
platform).
Full disclosure: I developed this software as part of my PhD project and am
still actively maintaining it
Regards,
--
Maarten van Gompel
Centre for Language Studies
Radboud Universiteit Nijmegen
proycon at anaproy.nl
http://proycon.anaproy.nl
http://github.com/proycon
GnuPG key: 0x1A31555C XMPP: proycon at anaproy.nl
Bitcoin: 1BRptZsKQtqRGSZ5qKbX2azbfiygHxJPsd
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list