<div dir="ltr">Dear Parth Mehta<div><br></div><div>You can do this in the Sketch Engine by, first, adding a metadata field to each doc for each subset that you want it to belong to, and then, building a subcorpus for each of the subsets. You can then get quick access to word lists etc for each subset.</div>
<div><br></div><div>If you want to go the Sketch Engine route, let us know and we can give you more detailed instructions.</div><div><br></div><div>It sounds like a good research question - wishing you all the best for it</div>
<div><br></div><div>Adam<br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Parth Mehta</b> <span dir="ltr"><<a href="mailto:parth.mehta126@gmail.com">parth.mehta126@gmail.com</a>></span><br>
Date: 8 January 2014 07:22<br>Subject: [Corpora-List] Extracting Index for a part of a corpus<br>To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br><br><br><div dir="ltr">Hello,<div><br></div><div>I am trying to model effect of corpus size on various statistical properties like term distributions and vocabulary size to begin with. </div>
<div><br></div><div>Is there any tool available which allows me to extract index for a part of my corpus, say index of 1000 documents at a time out of 100000 documents overall. Since the corpus size I am working with is large (~10^5 documents) and taking into account the need for n-fold validations I do not want to first manually split the corpus into parts and then build a new index every time. Instead I am looking for a tool that can allow me to index the entire corpus at a single go, and then I extract the information related to some specific documents. <div>
<br></div><div>Indri does provide me document vectors for individual documents, but in that case the term ids are unique only for that particular document. So if I extract document vectors for two different documents the term with term-id 1 might be different in both cases. I want a tool that maintains the term id of the overall corpus. </div>
<div><br></div><div><br></div><div>Thanks</div><span class="HOEnZb"><font color="#888888"><div>Parth Mehta</div><div>DA-IICT</div>
</font></span></div></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div>
</div></div>