[Corpora-List] Request for help concerning a LSA problem

Cecilie Desiree Widsteen cecilidw at student.iln.uio.no
Mon May 8 08:21:24 UTC 2006


Christopher Manning wrote:

> Cecilie Desiree Widsteen wrote:
>
>> My problem is: I have implemented a class which takes care of building a
>> matrix representation of a corpus, and performs SVD over the
>> term-by-document matrix. Most of the operations are done by the Jama
>> class "Matrix".  This works fine, except for the fact that when I ran
>> the program over various small test corpora (like, for instance, the one
>> from Chapter 15 in Schütze and Manning´s book Foundations of Statistical
>> NLP) most of the righ and left singular vectors contained the correct
>> values but with wrong/reversed sign?! E.g. a vector that should have the
>> values [-0.75,-0.28,-0.20, ...] are assigned the values [0.75,0.28,
>> ...]. Unfortunately, I have limited experience with linear algebra and
>> the like so now I  find myself completely at loss in debugging this...
>
>
> This isn't a problem!!!  This is the content of fn. 2 on p.561 of 
> anything-other-than-early printings of FSNLP:
>
>   For any given SVD solution,
>   you can get additional non-identical ones by flipping signs in
>   corresponding
>   left and right singular vectors of $T$ and $D$, and, if there are
>   two or more identical singular values, then the subspace determined by
>   the corresponding singular vectors is unique, but can be described
>   by any appropriate orthonormal basis vectors.  But, apart
>   from these cases, \acro{SVD} is unique.
>
> The minuses cancel out and so don't effect the solution.
>
Thank you! I suspected it might not be a bug, but I felt I needed to be 
sure of this. Does this mean that which solution that will be returned 
depends on the implementation of the SVD? E.g. the way the right and 
left singular vectors are computed?

> But, beyond that, I think you will find that you will have trouble 
> doing anything 'large scale' (i.e., text collections with vocabularies 
> of 20,000 words or things like that) using Jama, because it only 
> supports dense SVD calculations (that is, using 20,000x20,000 
> matrices, which require a lot of RAM).  For text applications, it's 
> usual to use something that supports doing SVD on sparse matrices, 
> like the classic SVDpack, Matlab, or, if you're using Java, you might 
> try MTJ:
>
>     http://rs.cipr.uib.no/mtj/
>
Thank you, I will check this out!

Regards,
--
Cecilie D. Widsteen
Department of Linguistics
University of Oslo



More information about the Corpora mailing list