[Corpora-List] New Semantic Vectors Package using Lucene and Random Projection

Dominic Widdows widdows at google.com
Thu Nov 29 22:17:21 UTC 2007


Dear Colleagues,

I'd like to announce the release of a new open source package for building
semantic vector models from corpora.
The source code, source distributions, and binary distributions are freely
available for download at http://code.google.com/p/semanticvectors/
There is also a Wiki, issue tracking board, links to a user group, mailing
list, etc.

The package was created during a project with the University of Pittsburgh's
Office of Technology matching, with the purpose of creating semantic matches
between technology disclosures and companies who might be  interested in
licensing the technology. The software written so far is copyrighted to the
University of Pittsburgh, and released under the terms of the (deliberately
permissive) new BSD license.

The package was created partly with lessons learned from the Infomap-NLP
package (which as some of you know, I helped to release and maintain for
some years). The main problems we've had with Infomap-NLP have been
difficulty of installation and scalability, and the new SemanticVectors
package is designed to be an improvement on both fronts. For ease of
installation and use, it's written entirely in Java and reuses Apache Lucene
for creating the initial term-document matrix, so the only dependencies are
Apache Lucene and Ant. (One side benefit being that you get a Lucene keyword
search index for free, which makes for interesting comparison.) For
scalability, the package uses a Random Projection algorithm instead of
Singular Value Decomposition, which means in practice that it can build a
lot more vectors with the same amount of memory and much less computation.

There are many items on the TODO list for the package, including comparative
evaluation with other dimension reduction techniques, experimenting with
iterative training phases, different vector product structures, adding and
extending the quantum connectives that Infomap-NLP implements, clustering,
visualization, etc. Nonetheless, I hope that the software is already in a
stable and useful enough state that it significantly lowers the bar of entry
for researchers and developers who want to try working with semantic vector
models.

Please feel free to try it out, and needless to say, let me know if you have
any comments or issues or questions.

Best wishes,
Dominic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071129/566b2d86/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list