[Corpora-List] Considering Distributions Across Texts

Adam Kilgarriff adam at lexmasterclass.com
Mon Mar 3 11:40:22 UTC 2014


Dear Brian,

Are the 300-400 texts from 300-400 different people?  If yes, then, if you
use document frequencies ("how many documents does this
word/construction/... occur in") rather than "how many times does it occur"
you will cancel out skews based on particular people.

If the texts are all the result of the same essay question, or a limited
number of essay questions, then of course you have the bias related to what
the students were being asked to write about.

I'm a sceptic about statistical significance testing (for the full argument
see here<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6901&rep=rep1&type=pdf>)
 - the main thing is to have a good understanding of the structure of your
sample, and the ways that is likely to introduce bias

Adam




On 3 March 2014 11:02, Don Tuggener <tuggener at cl.uzh.ch> wrote:

> Hi Brian,
>
> I'm guessing you're looking for tests that help you identify statistical
> significance of your query results?
> A good starting point may be:
> 2010f. Gries, Stefan Th. Useful statistics for corpus linguistics. In
> Aquilino Sánchez & Moisés Almela (eds.), A mosaic of corpus linguistics:
> selected approaches, 269-291. Frankfurt am Main: Peter Lang.
> (
> http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html
> )
>
> Best,
> Don
>
> On Mon, 03 Mar 2014 11:28:35 +0100
> corpora-request at uib.no wrote:
>
> > Message: 3
> > Date: Fri, 28 Feb 2014 11:16:11 -0500
> > From: Brian Schanding <bschanding at gmail.com>
> > Subject: [Corpora-List] Considering Distributions Across Texts
> > To: corpora at uib.no
> >
> > Hello,
> >
> > I'm working on research with learner corpora. My corpora aren't that big
> > (approx. 250,000 wds with about 300-400 text files). I wonder what
> > research/textbook sources anyone can point me to that discuss the
> > importance of considering how many texts in the corpus a language feature
> > occurs in (as opposed to merely considering overall frequency of a
> language
> > feature within a corpus).
> >
> > Many Thanks!
> > Brian
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for English
<http://www.webdante.com>                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140303/efe53208/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list