[Corpora-List] Semantic analysis tool

Radim Rehurek xrehurek at aisa.fi.muni.cz
Sun Sep 19 17:51:34 UTC 2010


Dear all,

I'd like to turn your attention to 'gensim', a new NLP tool I developed 
recently. Its aim is to make unsupervised "semantic analysis" (in the 
mundane statistical sense, no psychology/linguistics) of texts accessible 
to non-mathematicians.

Features:
* can process corpora larger than RAM (streamed algorithms)
* simple to trivial interfaces: you can get going quickly, no java-esque 
madness

gensim contains unique incremental implementations of popular algorithms 
like:

* Latent Semantic Analysis: takes 2.5 hours on a 2 billion corpus of 3.2M 
documents (the entire English Wikipedia), on a single laptop. LSA has not 
been used much in practical NLP due to its massive computational demands; 
now no longer an issue.

* Latent Dirichlet Allocation: a more recent but slower technique, can be 
run in distributed mode over a cluster of computers.


The tool has reached a reasonable level of maturity, what is needed now is 
your (user) feedback on what parts are useful, what to improve etc. If 
this brief summary caught your attention, please do visit 
http://nlp.fi.muni.cz/projekty/gensim/ for more.

Looking forward to your comments,
Radim


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list