Dear Colleagues,<br><br>I'd like to announce the release of a new open source package for building semantic vector models from corpora. <br>The source code, source distributions, and binary distributions are freely available for download at
<a href="http://code.google.com/p/semanticvectors/">http://code.google.com/p/semanticvectors/</a><br>There is also a Wiki, issue tracking board, links to a user group, mailing list, etc.<br><br>The package was created during a project with the University of Pittsburgh's Office of Technology matching, with the purpose of creating semantic matches between technology disclosures and companies who might be interested in licensing the technology. The software written so far is copyrighted to the University of Pittsburgh, and released under the terms of the (deliberately permissive) new BSD license.
<br><br>The package was created partly with lessons learned from the Infomap-NLP package (which as some of you know, I helped to release and maintain for some years). The main problems we've had with Infomap-NLP have been difficulty of installation and scalability, and the new SemanticVectors package is designed to be an improvement on both fronts. For ease of installation and use, it's written entirely in Java and reuses Apache Lucene for creating the initial term-document matrix, so the only dependencies are Apache Lucene and Ant. (One side benefit being that you get a Lucene keyword search index for free, which makes for interesting comparison.) For scalability, the package uses a Random Projection algorithm instead of Singular Value Decomposition, which means in practice that it can build a lot more vectors with the same amount of memory and much less computation.
<br><br>There are many items on the TODO list for the package, including comparative evaluation with other dimension reduction techniques, experimenting with iterative training phases, different vector product structures, adding and extending the quantum connectives that Infomap-NLP implements, clustering, visualization, etc. Nonetheless, I hope that the software is already in a stable and useful enough state that it significantly lowers the bar of entry for researchers and developers who want to try working with semantic vector models.
<br><br>Please feel free to try it out, and needless to say, let me know if you have any comments or issues or questions.<br><br>Best wishes,<br>Dominic<br>