[Corpora-List] Inquiry on the usage of rainbow for text classification
F Su
fzsu at comp.leeds.ac.uk
Sat Feb 23 17:29:31 UTC 2008
Dear all,
Does anybody have experience in using rainbow for text classification (a
toolkit written by Andrew McCallum and here is the link about it
http://www.cs.cmu.edu/~mccallum/bow/rainbow/)?
I have read the usage document, it says that the basic setting is, the
text data should be in plian text files, one file per document.
But it also says that it can Finding `document' boundaries when there are
multiple documents per file. This make me believe that one file can also
contain more than one documents. But I haven't found out the exact
soluction to it from the usage document.
My question is that, if a file contains more than one documents (for
example, news are gathered in a file), not only a document, is it
possilble to apply the rainbow software directly? or I have to extract
each news and save it in a file seperately? Of course I can preprocess in
this way, but as in our dataset, each document is very short (around 10
words), and we have more than 100,000 document, so we prefer to save them
in a file.
Any guidance will be highly appreciated.
Thanks,
Fangzhong
--
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list