[Corpora-List] Request for help concerning a LSA problem

Nitin Madnani nmadnani at gmail.com
Fri May 5 20:33:35 UTC 2006


I recommend that you look at TMG (Text-to-Matrix Generator) for
Matlab. Matlab has excellent support for sparse arrays and TMG uses
them natively. I can send you code that I used in my own project, if
you need.

Nitin

On 5/5/06, Christopher Manning <manning at cs.stanford.edu> wrote:
> Cecilie Desiree Widsteen wrote:
> > Hello all,
> >
> > I´m currently trying to implement Latent Semantic Analysis, as part of
> > an automatic classification system. I´m programming in Java, and using
> > the Jama Matrix package for the matrix stuff. I have stumbled over some
> > strange problems, and would be grateful if anyone on this list  could
> > offer some help.
> > My problem is: I have implemented a class which takes care of building a
> > matrix representation of a corpus, and performs SVD over the
> > term-by-document matrix. Most of the operations are done by the Jama
> > class "Matrix".  This works fine, except for the fact that when I ran
> > the program over various small test corpora (like, for instance, the one
> > from Chapter 15 in Schütze and Manning´s book Foundations of Statistical
> > NLP) most of the righ and left singular vectors contained the correct
> > values but with wrong/reversed sign?! E.g. a vector that should have the
> > values [-0.75,-0.28,-0.20, ...] are assigned the values [0.75,0.28,
> > ...]. Unfortunately, I have limited experience with linear algebra and
> > the like so now I  find myself completely at loss in debugging this...
>
> This isn't a problem!!!  This is the content of fn. 2 on p.561 of
> anything-other-than-early printings of FSNLP:
>
>    For any given SVD solution,
>    you can get additional non-identical ones by flipping signs in
>    corresponding
>    left and right singular vectors of $T$ and $D$, and, if there are
>    two or more identical singular values, then the subspace determined by
>    the corresponding singular vectors is unique, but can be described
>    by any appropriate orthonormal basis vectors.  But, apart
>    from these cases, \acro{SVD} is unique.
>
> The minuses cancel out and so don't effect the solution.
>
> But, beyond that, I think you will find that you will have trouble doing
> anything 'large scale' (i.e., text collections with vocabularies of 20,000
> words or things like that) using Jama, because it only supports dense SVD
> calculations (that is, using 20,000x20,000 matrices, which require a lot of
> RAM).  For text applications, it's usual to use something that supports
> doing SVD on sparse matrices, like the classic SVDpack, Matlab, or, if
> you're using Java, you might try MTJ:
>
>         http://rs.cipr.uib.no/mtj/
>
> Chris.
>
>
>
>
> > As far as I can understand, this means that my vectors are pointing in
> > the opposite direction from the one they should, but why this is escapes
> > my understanding :)
> > Any help, hints, tricks and the like are extremely welcome! I can also
> > send over the source code on request.
> >
> > Regards,
> > --
> > Cecilie D. Widsteen
> > Department of Linguistics
> > University of Oslo
> >
> >
>
>


--
Got Blog?
http://greenideas.blogspot.com



More information about the Corpora mailing list