[Corpora-List] SVD on high-dimension data
Yannick Versley
versley at sfs.uni-tuebingen.de
Tue Mar 6 15:10:30 UTC 2007
Hi,
> I have large (1 million by 1 million) term-term matrices. What SVD
> packages work with such massive datasets? I have tried Matlab and
> SVDPACKC without much success.
Both Matlab and the Harwell-Boeing format used by SVDPACK(C) use sparse
matrices, which means that the dimensionality (=number of terms) does not
really matter, but the number of non-zero entries does. To solve your
problem, you could either:
- adjust the constants in the SVDPACKC source code that give maximum limits
for dimensionality and non-zero entries and run the SVD on a machine with
lots of memory.
Ted Pedersen's SenseClusters software uses SVDPACKC and its documentation
gives good advice regarding the values that you need to tweak.
or
- try to somehow reduce the number of terms and/or the number of non-zero
entries. A sensible thing to do would be to throw away terms that don't occur
at least 5 times in your corpus, and, if the matrix is still too big, throw
away all entries which are below a certain threshold (e.g. all entries with
only 1 in it).
Cheers,
Yannick
--
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352
More information about the Corpora
mailing list