[Corpora-List] Semantic analysis tool
Radim Rehurek
xrehurek at aisa.fi.muni.cz
Sun Sep 19 17:51:34 UTC 2010
Dear all,
I'd like to turn your attention to 'gensim', a new NLP tool I developed
recently. Its aim is to make unsupervised "semantic analysis" (in the
mundane statistical sense, no psychology/linguistics) of texts accessible
to non-mathematicians.
Features:
* can process corpora larger than RAM (streamed algorithms)
* simple to trivial interfaces: you can get going quickly, no java-esque
madness
gensim contains unique incremental implementations of popular algorithms
like:
* Latent Semantic Analysis: takes 2.5 hours on a 2 billion corpus of 3.2M
documents (the entire English Wikipedia), on a single laptop. LSA has not
been used much in practical NLP due to its massive computational demands;
now no longer an issue.
* Latent Dirichlet Allocation: a more recent but slower technique, can be
run in distributed mode over a cluster of computers.
The tool has reached a reasonable level of maturity, what is needed now is
your (user) feedback on what parts are useful, what to improve etc. If
this brief summary caught your attention, please do visit
http://nlp.fi.muni.cz/projekty/gensim/ for more.
Looking forward to your comments,
Radim
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list