[Corpora-List] Request for help concerning a LSA problem

Christopher Manning manning at cs.stanford.edu
Fri May 5 15:28:32 UTC 2006


Cecilie Desiree Widsteen wrote:
> Hello all,
> 
> I´m currently trying to implement Latent Semantic Analysis, as part of
> an automatic classification system. I´m programming in Java, and using
> the Jama Matrix package for the matrix stuff. I have stumbled over some
> strange problems, and would be grateful if anyone on this list  could
> offer some help.
> My problem is: I have implemented a class which takes care of building a
> matrix representation of a corpus, and performs SVD over the
> term-by-document matrix. Most of the operations are done by the Jama
> class "Matrix".  This works fine, except for the fact that when I ran
> the program over various small test corpora (like, for instance, the one
> from Chapter 15 in Schütze and Manning´s book Foundations of Statistical
> NLP) most of the righ and left singular vectors contained the correct
> values but with wrong/reversed sign?! E.g. a vector that should have the
> values [-0.75,-0.28,-0.20, ...] are assigned the values [0.75,0.28,
> ...]. Unfortunately, I have limited experience with linear algebra and
> the like so now I  find myself completely at loss in debugging this...

This isn't a problem!!!  This is the content of fn. 2 on p.561 of 
anything-other-than-early printings of FSNLP:

   For any given SVD solution,
   you can get additional non-identical ones by flipping signs in
   corresponding
   left and right singular vectors of $T$ and $D$, and, if there are
   two or more identical singular values, then the subspace determined by
   the corresponding singular vectors is unique, but can be described
   by any appropriate orthonormal basis vectors.  But, apart
   from these cases, \acro{SVD} is unique.

The minuses cancel out and so don't effect the solution.

But, beyond that, I think you will find that you will have trouble doing 
anything 'large scale' (i.e., text collections with vocabularies of 20,000 
words or things like that) using Jama, because it only supports dense SVD 
calculations (that is, using 20,000x20,000 matrices, which require a lot of 
RAM).  For text applications, it's usual to use something that supports 
doing SVD on sparse matrices, like the classic SVDpack, Matlab, or, if 
you're using Java, you might try MTJ:

	http://rs.cipr.uib.no/mtj/

Chris.




> As far as I can understand, this means that my vectors are pointing in
> the opposite direction from the one they should, but why this is escapes
> my understanding :)
> Any help, hints, tricks and the like are extremely welcome! I can also
> send over the source code on request.
> 
> Regards,
> -- 
> Cecilie D. Widsteen
> Department of Linguistics
> University of Oslo
> 
> 



More information about the Corpora mailing list