<html><body><div style="color:#000; background-color:#fff; font-family:arial, helvetica, sans-serif;font-size:10pt"><div style="font-size: 16px;"><span style="font-size: 13px;">Hi Brian</span></div><div style="font-size: medium; background-color: transparent;"><span style="font-size: small;"><br></span></div><div style="font-size: medium; background-color: transparent;"><span style="font-size: 13px;">In our study, we took into account normalised frequency in terms of both number of words and number of speakers. </span></div><div style="font-size: medium; background-color: transparent;"><span style="font-size: small;"><br></span></div><div style="font-size: 16px; background-color: transparent;"><span style="font-size: 13px;"><span><span><span style="text-align: justify; text-indent: -37.7952766418457px;">Torgersen, E., Gabrielatos, C., Hoffmann, S. & Fox, S. (2011). A corpus-based study of pragmatic markers in London English. </span><i
style="text-align: justify; text-indent: -37.7952766418457px;">Corpus Linguistics and Linguistic Theory</i><span style="text-align: justify; text-indent: -37.7952766418457px;">, 7(1), 93-118. </span></span></span><span style="line-height: 16.799999237060547px; white-space: nowrap;">dx.doi.org/</span><span style="line-height: 16.799999237060547px; white-space: nowrap;">10.1515</span><span style="line-height: 16.799999237060547px; white-space: nowrap;">/</span><span style="line-height: 16.799999237060547px; white-space: nowrap;">cllt</span><span style="line-height: 16.799999237060547px; white-space: nowrap;">.</span><span style="line-height: 16.799999237060547px; white-space: nowrap;">2011.005 </span></span></div><div style="background-color: transparent;"><span><br></span></div><div style="background-color: transparent;">Costas</div><div style="background-color: transparent;"><br></div><div class="yiv4534078945yui_3_7_2_18_1366706348531_71"
style="font-family: arial, helvetica, sans-serif; line-height: normal; color: rgb(67, 67, 67); font-size: 10px; background-color: transparent; font-style: normal;"><span style="font-size:10px;color:rgb(67, 67, 67);"><span style="font-size:10px;"></span></span></div><div><span></span></div><div><br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; margin-top: 5px; padding-left: 5px;"> <div style="font-family: arial, helvetica, sans-serif; font-size: 10pt;"> <div style="font-family: 'times new roman', 'new york', times, serif; font-size: 12pt;"> <div dir="ltr"> <hr size="1"> <font size="2" face="Arial"> <b><span style="font-weight:bold;">From:</span></b> Marc Brysbaert <marc.brysbaert@ugent.be><br> <b><span style="font-weight: bold;">To:</span></b> corpora@uib.no <br> <b><span style="font-weight: bold;">Sent:</span></b> Monday, 3 March 2014, 12:06<br> <b><span style="font-weight: bold;">Subject:</span></b> Re:
[Corpora-List] Considering Distributions Across Texts<br> </font> </div> <div class="y_msg_container"><br><div id="yiv3556722553">
<style><!--
#yiv3556722553
_filtered #yiv3556722553 {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}
_filtered #yiv3556722553 {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}
_filtered #yiv3556722553 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}
_filtered #yiv3556722553 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}
_filtered #yiv3556722553 {panose-1:2 1 6 0 3 1 1 1 1 1;}
#yiv3556722553
#yiv3556722553 p.yiv3556722553MsoNormal, #yiv3556722553 li.yiv3556722553MsoNormal, #yiv3556722553 div.yiv3556722553MsoNormal
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;font-family:"Times New Roman", "serif";}
#yiv3556722553 a:link, #yiv3556722553 span.yiv3556722553MsoHyperlink
{color:blue;text-decoration:underline;}
#yiv3556722553 a:visited, #yiv3556722553 span.yiv3556722553MsoHyperlinkFollowed
{color:purple;text-decoration:underline;}
#yiv3556722553 span.yiv3556722553EmailStyle17
{font-family:"Calibri", "sans-serif";color:#1F497D;}
#yiv3556722553 .yiv3556722553MsoChpDefault
{font-family:"Calibri", "sans-serif";}
_filtered #yiv3556722553 {margin:70.85pt 70.85pt 70.85pt 70.85pt;}
#yiv3556722553 div.yiv3556722553WordSection1
{}
--></style><div><div class="yiv3556722553WordSection1"><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">James Adelman (University of Warwick) has argued that contextual diversity (number of texts in which a word appears) is better than word frequency (number of times a word occurs). You find the paper here:</span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);"> </span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);"><a rel="nofollow" target="_blank" href="http://homepages.warwick.ac.uk/~pssgar/cd/cdps4.pdf">http://homepages.warwick.ac.uk/~pssgar/cd/cdps4.pdf</a></span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">
</span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">Also in our work, we find that contextual diversity predicts word recognition times better than word frequency. </span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);"> </span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">This being said, most of the difference is due to the nonlinear frequency function and to the fact that proper nouns seem to be unevenly distributed across texts. When you take these two factors into account, there is not much difference any more. In an unpublished ms, Emmanuel Keuleers found that you find a contextual diversity ‘advantage’ if you randomly distribute the words over the files. This agrees with the fact
that much of the superiority is due to the nonlinear function of frequency (there is a floor effect for word frequency above 50 per million words; this tail is much shorter in contextual diversities than in word frequencies).</span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);"> </span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">So, in summary you can use both contextual diversity and word frequency.</span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);"> </span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);">Best, mb</span></div><div class="yiv3556722553MsoNormal"><span style="font-size: 11pt; font-family:
Calibri, sans-serif; color: rgb(31, 73, 125);"> </span></div><div class="yiv3556722553MsoNormal"><b><span style="font-size: 10pt; font-family: Tahoma, sans-serif;">From:</span></b><span style="font-size: 10pt; font-family: Tahoma, sans-serif;"> corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>On Behalf Of </b>Adam Kilgarriff<br><b>Sent:</b> maandag 3 maart 2014 12:40<br><b>To:</b> Don Tuggener<br><b>Cc:</b> corpora@uib.no<br><b>Subject:</b> Re: [Corpora-List] Considering Distributions Across Texts</span></div><div class="yiv3556722553MsoNormal"> </div><div><div class="yiv3556722553MsoNormal">Dear Brian,</div><div><div class="yiv3556722553MsoNormal"> </div></div><div><div class="yiv3556722553MsoNormal">Are the 300-400 texts from 300-400 different people? If yes, then, if you use document frequencies ("how many documents does this word/construction/... occur in") rather than "how many times does it occur" you will cancel
out skews based on particular people. </div></div><div><div class="yiv3556722553MsoNormal"> </div></div><div><div class="yiv3556722553MsoNormal">If the texts are all the result of the same essay question, or a limited number of essay questions, then of course you have the bias related to what the students were being asked to write about.</div></div><div><div class="yiv3556722553MsoNormal"> </div></div><div><div class="yiv3556722553MsoNormal">I'm a sceptic about statistical significance testing (for the full argument see<a rel="nofollow" target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6901&rep=rep1&type=pdf"> here</a>) - the main thing is to have a good understanding of the structure of your sample, and the ways that is likely to introduce bias</div></div><div><div class="yiv3556722553MsoNormal"> </div></div><div><div class="yiv3556722553MsoNormal">Adam</div></div><div><div
class="yiv3556722553MsoNormal"> </div></div><div><div class="yiv3556722553MsoNormal"> </div></div></div><div><div class="yiv3556722553MsoNormal" style="margin-bottom:12.0pt;"> </div><div><div class="yiv3556722553MsoNormal">On 3 March 2014 11:02, Don Tuggener <<a rel="nofollow" ymailto="mailto:tuggener@cl.uzh.ch" target="_blank" href="mailto:tuggener@cl.uzh.ch">tuggener@cl.uzh.ch</a>> wrote:</div><div class="yiv3556722553MsoNormal">Hi Brian,<br><br>I'm guessing you're looking for tests that help you identify statistical significance of your query results?<br>A good starting point may be:<br>2010f. Gries, Stefan Th. Useful statistics for corpus linguistics. In Aquilino Sánchez & Moisés Almela (eds.), A mosaic of corpus linguistics: selected approaches, 269-291. Frankfurt am Main: Peter Lang.<br>(<a rel="nofollow" target="_blank"
href="http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html">http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html</a>)<br><br>Best,<br>Don<br><br>On Mon, 03 Mar 2014 11:28:35 +0100<br><a rel="nofollow" ymailto="mailto:corpora-request@uib.no" target="_blank" href="mailto:corpora-request@uib.no">corpora-request@uib.no</a> wrote:<br><br>> Message: 3<br>> Date: Fri, 28 Feb 2014 11:16:11 -0500<br>> From: Brian Schanding <<a rel="nofollow" ymailto="mailto:bschanding@gmail.com" target="_blank" href="mailto:bschanding@gmail.com">bschanding@gmail.com</a>><br>> Subject: [Corpora-List] Considering Distributions Across Texts<br>> To: <a rel="nofollow" ymailto="mailto:corpora@uib.no" target="_blank" href="mailto:corpora@uib.no">corpora@uib.no</a></div><div><div><div class="yiv3556722553MsoNormal">><br>> Hello,<br>><br>> I'm working on research with learner corpora. My corpora
aren't that big<br>> (approx. 250,000 wds with about 300-400 text files). I wonder what<br>> research/textbook sources anyone can point me to that discuss the<br>> importance of considering how many texts in the corpus a language feature<br>> occurs in (as opposed to merely considering overall frequency of a language<br>> feature within a corpus).<br>><br>> Many Thanks!<br>> Brian<br></div></div></div></div></div></div></div></div></div> </div> </div> </blockquote></div> </div></body></html>