[Corpora-List] Extracting Index for a part of a corpus

Mark Davies Mark_Davies at byu.edu
Wed Jan 8 19:07:42 UTC 2014


>>  I do not want to first manually split the corpus into parts and then build a new index every time. Instead I am looking for a tool that can allow me to index the entire corpus at a single go, and then I extract the information related to some specific documents.

This can also be done quite easily if the data is in a relational database. Assuming there is a textID field for each section of text, just create a JOIN between that table and the metadata table, and use WHERE to limit the metadata rows (e.g. "where subcorpus = 'biology' and where metadata..textID = corpus..textID"). You'll then just be searching just the desired section of the corpus, and whatever index(es) you've created for the "textual" corpus will still be used for just those rows.

This is the approach used, for example, with the BYU corpora (http://corpus.byu.edu), where you can easily limit by (and compare between) different sections of the corpora -- genre, dialect, or historical period (see http://corpus.byu.edu/variation.asp )

Best,

Mark D.


============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Adam Kilgarriff [adam at lexmasterclass.com]
Sent: Wednesday, January 08, 2014 10:25 AM
To: Parth Mehta
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Extracting Index for a part of a corpus

Dear Parth Mehta

You can do this in the Sketch Engine by, first, adding a metadata field to each doc for each subset that you want it to belong to, and then, building a subcorpus for each of the subsets.  You can then get quick access to word lists etc for each subset.

If you want to go the Sketch Engine route, let us know and we can give you more detailed instructions.

It sounds like a good research question - wishing you all the best for it

Adam

---------- Forwarded message ----------
From: Parth Mehta <parth.mehta126 at gmail.com<mailto:parth.mehta126 at gmail.com>>
Date: 8 January 2014 07:22
Subject: [Corpora-List] Extracting Index for a part of a corpus
To: corpora at uib.no<mailto:corpora at uib.no>


Hello,

I am trying to model effect of corpus size on various statistical properties like term distributions and vocabulary size to begin with.

Is there any tool available which allows me to extract index for a part of my corpus, say index of 1000 documents at a time out of 100000 documents overall. Since the corpus size I am working with is large (~10^5 documents) and taking into account the need for n-fold validations I do not want to first manually split the corpus into parts and then build a new index every time. Instead I am looking for a tool that can allow me to index the entire corpus at a single go, and then I extract the information related to some specific documents.

Indri does provide me document vectors for individual documents, but in that case the term ids are unique only for that particular document. So if I extract document vectors for two different documents the term with term-id 1 might be different in both cases. I want a tool that maintains the term id of the overall corpus.


Thanks
Parth Mehta
DA-IICT

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora




--
========================================
Adam Kilgarriff<http://www.kilgarriff.co.uk/>                  adam at lexmasterclass.com<mailto:adam at lexmasterclass.com>
Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk>
Corpora for all with the Sketch Engine<http://www.sketchengine.co.uk>
                        DANTE: a lexical database for English<http://www.webdante.com>
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140108/bf62b311/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list