[Corpora-List] Extracting Index for a part of a corpus

Wed Jan 8 07:22:55 UTC 2014

Hello,

I am trying to model effect of corpus size on various statistical
properties like term distributions and vocabulary size to begin with.

Is there any tool available which allows me to extract index for a part of
my corpus, say index of 1000 documents at a time out of 100000 documents
overall. Since the corpus size I am working with is large (~10^5 documents)
and taking into account the need for n-fold validations I do not want to
first manually split the corpus into parts and then build a new index every
time. Instead I am looking for a tool that can allow me to index the entire
corpus at a single go, and then I extract the information related to some
specific documents.

Indri does provide me document vectors for individual documents, but in
that case the term ids are unique only for that particular document. So if
I extract document vectors for two different documents the term with
term-id 1 might be different in both cases. I want a tool that maintains
the term id of the overall corpus.

Thanks
Parth Mehta
DA-IICT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140108/6c2c78b8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora