[Corpora-List] SVD on high-dimension data

Yannick Versley versley at sfs.uni-tuebingen.de
Tue Mar 6 15:10:30 UTC 2007


Hi,

> I have large (1 million by 1 million) term-term matrices. What SVD
> packages work with such massive datasets? I have tried Matlab and
> SVDPACKC without much success.
Both Matlab and the Harwell-Boeing format used by SVDPACK(C) use sparse 
matrices, which means that the dimensionality (=number of terms) does not 
really matter, but the number of non-zero entries does. To solve your 
problem, you could either:
- adjust the constants in the SVDPACKC source code that give maximum limits 
for dimensionality and non-zero entries and run the SVD on a machine with 
lots of memory.
Ted Pedersen's SenseClusters software uses SVDPACKC and its documentation 
gives good advice regarding the values that you need to tweak.
or
- try to somehow reduce the number of terms and/or the number of non-zero 
entries. A sensible thing to do would be to throw away terms that don't occur 
at least 5 times in your corpus, and, if the matrix is still too big, throw 
away all entries which are below a certain threshold (e.g. all entries with 
only 1 in it).

Cheers,
Yannick
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352



More information about the Corpora mailing list