[Corpora-List] Extracting Index for a part of a corpus

Adam Kilgarriff adam at lexmasterclass.com
Wed Jan 8 17:25:11 UTC 2014


Dear Parth Mehta

You can do this in the Sketch Engine by, first, adding a metadata field to
each doc for each subset that you want it to belong to, and then, building
a subcorpus for each of the subsets.  You can then get quick access to word
lists etc for each subset.

If you want to go the Sketch Engine route, let us know and we can give you
more detailed instructions.

It sounds like a good research question - wishing you all the best for it

Adam

---------- Forwarded message ----------
From: Parth Mehta <parth.mehta126 at gmail.com>
Date: 8 January 2014 07:22
Subject: [Corpora-List] Extracting Index for a part of a corpus
To: corpora at uib.no


Hello,

I am trying to model effect of corpus size on various statistical
properties like term distributions and vocabulary size to begin with.

Is there any tool available which allows me to extract index for a part of
my corpus, say index of 1000 documents at a time out of 100000 documents
overall. Since the corpus size I am working with is large (~10^5 documents)
and taking into account the need for n-fold validations I do not want to
first manually split the corpus into parts and then build a new index every
time. Instead I am looking for a tool that can allow me to index the entire
corpus at a single go, and then I extract the information related to some
specific documents.

Indri does provide me document vectors for individual documents, but in
that case the term ids are unique only for that particular document. So if
I extract document vectors for two different documents the term with
term-id 1 might be different in both cases. I want a tool that maintains
the term id of the overall corpus.


Thanks
Parth Mehta
DA-IICT

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora




-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for English
<http://www.webdante.com>                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140108/7f5e80ef/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list