[Corpora-List] SVD on high-dimension data

Tue Mar 6 16:21:02 UTC 2007

Dear Jamie, David,

I'm delighted to hear about your success using Infomap, thanks!

However, I feel I should chime in with a couple of words of warning.  
Infomap works by selecting a comparatively small number of "content  
bearing words" as column labels. These are normally chosen based upon  
frequency, e.g., we have typically used the 1000 most frequent non- 
stop words as column labels. This is a far cry from your 1 million by  
1 million matrix. If Infomap was configured to treat all these terms  
as column labels, it would try to malloc a 1 million by 1 million  
matrix, which (if your matrix entry type is a 4 byte float) comes to  
something like 4 terabytes of RAM! That's before you've even tried to  
do anything computationally intensive with the matrix. By the time  
you have a computer with that much memory, I practically guarantee  
that 1 million terms will be considered a small dataset, so I believe  
that the scalability of software like Infomap is always going to be  
limited unless we make some radical changes to the way the software  
works. I'm hoping to do this at some point, but in the meantime, if  
you want to use Infomap your number of columns is limited.

We should probably use sparse matrices to count the coocurrences in  
the first place, but even if we could get this far, we'd run into  
scaling issues with SVD computation at some point. I'm not sure which  
weak link would break first - SVDPACKC does take advantage of some  
sparseness in the matrix format but it certainly involves a huge  
amount of number crunching for large matrices.

Best wishes,
Dominic

On Mar 6, 2007, at 10:38 AM, David Reitter wrote:

> Jamie,
>
> On 6 Mar 2007, at 14:59, Jamie Smith wrote:
>
>> I have large (1 million by 1 million) term-term matrices. What SVD
>> packages work with such massive datasets? I have tried Matlab and
>> SVDPACKC without much success.
>
> Have a look at Infomap,
>
> http://infomap-nlp.sourceforge.net/
> http://infomap.stanford.edu/
>
> we've used it successfully on the Aquaint  and DUC2005 data (100+  
> million words).
>
>
> --
> David Reitter
> ICCS/HCRC, Informatics, University of Edinburgh
> http://www.david-reitter.com
>
>
>
>
>
>