<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" id="owaParaStyle"></style>

</head>

<body fpstyle="1" ocsi="0">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">>> <span style="font-family: 'Times New Roman'; font-size: 16.363636016845703px;"> I do not want to first manually split the corpus into parts and then build a new index every time.

 Instead I am looking for a tool that can allow me to index the entire corpus at a single go, and then I extract the information related to some specific documents. </span>

<div><br>

</div>

<div>This can also be done quite easily if the data is in a relational database. Assuming there is a textID field for each section of text, just create a JOIN between that table and the metadata table, and use WHERE to limit the metadata rows (e.g. "where subcorpus

 = 'biology' and where metadata..textID = corpus..textID"). You'll then just be searching just the desired section of the corpus, and whatever index(es) you've created for the "textual" corpus will still be used for just those rows.</div>

<div><br>

</div>

<div>This is the approach used, for example, with the BYU corpora (http://corpus.byu.edu), where you can easily limit by (and compare between) different sections of the corpora -- genre, dialect, or historical period (see <a href="http://corpus.byu.edu/variation.asp" target="_blank" style="font-size: 10pt;">http://corpus.byu.edu/variation.asp</a> )</div>

<div><br>

</div>

<div>Best,</div>

<div><br>

</div>

<div>Mark D.<br>

<div><br>

<div style="font-family:Tahoma; font-size:13px">

<div style="font-family:Tahoma; font-size:13px">

<p>============================================<br>

Mark Davies<br>

Professor of Linguistics / Brigham Young University<br>

<a tabindex="0" href="http://davies-linguistics.byu.edu/">http://davies-linguistics.byu.edu/</a></p>

<p>** Corpus design and use // Linguistic databases **<br>

** Historical linguistics // Language variation **<br>

** English, Spanish, and Portuguese **<br>

============================================<br>

</p>

</div>

</div>

</div>

<div style="font-family: Times New Roman; color: #000000; font-size: 16px">

<hr tabindex="-1">

<div id="divRpF755574" style="direction: ltr;"><font face="Tahoma" size="2" color="#000000"><b>From:</b> corpora-bounces@uib.no [corpora-bounces@uib.no] on behalf of Adam Kilgarriff [adam@lexmasterclass.com]<br>

<b>Sent:</b> Wednesday, January 08, 2014 10:25 AM<br>

<b>To:</b> Parth Mehta<br>

<b>Cc:</b> corpora@uib.no<br>

<b>Subject:</b> Re: [Corpora-List] Extracting Index for a part of a corpus<br>

</font><br>

</div>

<div></div>

<div>

<div dir="ltr">Dear Parth Mehta

<div><br>

</div>

<div>You can do this in the Sketch Engine by, first, adding a metadata field to each doc for each subset that you want it to belong to, and then, building a subcorpus for each of the subsets.  You can then get quick access to word lists etc for each subset.</div>

<div><br>

</div>

<div>If you want to go the Sketch Engine route, let us know and we can give you more detailed instructions.</div>

<div><br>

</div>

<div>It sounds like a good research question - wishing you all the best for it</div>

<div><br>

</div>

<div>Adam<br>

<br>

<div class="gmail_quote">---------- Forwarded message ----------<br>

From: <b class="gmail_sendername">Parth Mehta</b> <span dir="ltr"><<a href="mailto:parth.mehta126@gmail.com" target="_blank">parth.mehta126@gmail.com</a>></span><br>

Date: 8 January 2014 07:22<br>

Subject: [Corpora-List] Extracting Index for a part of a corpus<br>

To: <a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a><br>

<br>

<br>

<div dir="ltr">Hello,

<div><br>

</div>

<div>I am trying to model effect of corpus size on various statistical properties like term distributions and vocabulary size to begin with. </div>

<div><br>

</div>

<div>Is there any tool available which allows me to extract index for a part of my corpus, say index of 1000 documents at a time out of 100000 documents overall. Since the corpus size I am working with is large (~10^5 documents) and taking into account the

 need for n-fold validations I do not want to first manually split the corpus into parts and then build a new index every time. Instead I am looking for a tool that can allow me to index the entire corpus at a single go, and then I extract the information related

 to some specific documents. 

<div><br>

</div>

<div>Indri does provide me document vectors for individual documents, but in that case the term ids are unique only for that particular document. So if I extract document vectors for two different documents the term with term-id 1 might be different in both

 cases. I want a tool that maintains the term id of the overall corpus. </div>

<div><br>

</div>

<div><br>

</div>

<div>Thanks</div>

<span class="HOEnZb"><font color="#888888">

<div>Parth Mehta</div>

<div>DA-IICT</div>

</font></span></div>

</div>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">

http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br>

</div>

<br>

<br clear="all">

<div><br>

</div>

-- <br>

========================================<br>

<a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a>                  <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a>                                             <br>

Director                                    <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a>                <br>

Visiting Research Fellow                 <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a>     

<div><i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">

the Sketch Engine</a>                 </div>

<div>                        <i><a href="http://www.webdante.com" target="_blank">DANTE:

<font color="#009900">a lexical database for English</font></a><font color="#009900"> </font>                 </i>

<div>========================================</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</body>

</html>