[Corpora-List] SVD on high-dimension data
Dominic Widdows
widdows at maya.com
Tue Mar 6 16:21:02 UTC 2007
Dear Jamie, David,
I'm delighted to hear about your success using Infomap, thanks!
However, I feel I should chime in with a couple of words of warning.
Infomap works by selecting a comparatively small number of "content
bearing words" as column labels. These are normally chosen based upon
frequency, e.g., we have typically used the 1000 most frequent non-
stop words as column labels. This is a far cry from your 1 million by
1 million matrix. If Infomap was configured to treat all these terms
as column labels, it would try to malloc a 1 million by 1 million
matrix, which (if your matrix entry type is a 4 byte float) comes to
something like 4 terabytes of RAM! That's before you've even tried to
do anything computationally intensive with the matrix. By the time
you have a computer with that much memory, I practically guarantee
that 1 million terms will be considered a small dataset, so I believe
that the scalability of software like Infomap is always going to be
limited unless we make some radical changes to the way the software
works. I'm hoping to do this at some point, but in the meantime, if
you want to use Infomap your number of columns is limited.
We should probably use sparse matrices to count the coocurrences in
the first place, but even if we could get this far, we'd run into
scaling issues with SVD computation at some point. I'm not sure which
weak link would break first - SVDPACKC does take advantage of some
sparseness in the matrix format but it certainly involves a huge
amount of number crunching for large matrices.
Best wishes,
Dominic
On Mar 6, 2007, at 10:38 AM, David Reitter wrote:
> Jamie,
>
> On 6 Mar 2007, at 14:59, Jamie Smith wrote:
>
>> I have large (1 million by 1 million) term-term matrices. What SVD
>> packages work with such massive datasets? I have tried Matlab and
>> SVDPACKC without much success.
>
> Have a look at Infomap,
>
> http://infomap-nlp.sourceforge.net/
> http://infomap.stanford.edu/
>
> we've used it successfully on the Aquaint and DUC2005 data (100+
> million words).
>
>
> --
> David Reitter
> ICCS/HCRC, Informatics, University of Edinburgh
> http://www.david-reitter.com
>
>
>
>
>
>
More information about the Corpora
mailing list