[Corpora-List] Inquiry on the usage of rainbow for text classification

Sat Feb 23 17:29:31 UTC 2008

Dear all,

Does anybody have experience in using rainbow for text classification (a 
toolkit written by Andrew McCallum and here is the link about it 
http://www.cs.cmu.edu/~mccallum/bow/rainbow/)?

I have read the usage document, it says that the basic setting is, the 
text data should be in plian text files, one file per document.

But it also says that it can  Finding `document' boundaries when there are 
multiple documents per file. This make me believe that one file can also 
contain more than one documents. But I haven't found out the exact 
soluction to it from the usage document.

My question is that, if a file contains more than one documents (for 
example, news are gathered in a file), not only a document, is it 
possilble to apply the rainbow software directly? or I have to extract 
each news and save it in a file seperately? Of course I can preprocess in 
this way, but as in our dataset, each document is very short (around 10 
words), and we have more than 100,000 document, so we prefer to save them 
in a file.

Any guidance will be highly appreciated.

Thanks,
Fangzhong

-- 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora