<div dir="ltr"><div class="gmail_default" style="font-family:garamond,serif">Sure, it is difficult to talk about concrete varieties of language to include or to exclude because they can be very different. To reduce complexity, it would be more appropriate to talk about particular categories of varieties, for example, grammatical.</div>
</div><div class="gmail_extra"><br clear="all"><div><div dir="ltr">--<div>Alexander Osherenko, Dr. rer. nat.<div><a href="http://www.humboldt-innovation.de/" target="_blank">Humboldt Innovation</a></div><div>Humboldt-Universität zu Berlin<br>
</div><div>Senior HCI architect</div><div><br></div><div><a href="http://www.socioware.de/" target="_blank">Socioware Development</a><div>Founder and R&D</div></div></div></div></div>
<br><br><div class="gmail_quote">2014-01-26 Adam Kilgarriff <span dir="ltr"><<a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">I'd say it's wildly optimistic to talk about representativeness (of "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") . It assumes you know the population to be represented. What corners and niches of the language in question do we want to include, and which exclude? We don't even know what they are yet, let alone how to collect them, and how much of each we want. The best we can do is to stay open to all the varieties of a language that there are, gather data for them where we can, and explore how they relate to each other. That's my research agenda<div>
<br></div><div>Adam</div></div><div class="gmail_extra"><div><div class="h5"><br><br><div class="gmail_quote">On 26 January 2014 07:15, Xu Jiajin <span dir="ltr"><<a href="mailto:ustcxujj@gmail.com" target="_blank">ustcxujj@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">
<p class="MsoNormal"><span lang="EN-GB">Hi Kevin,</span></p>
<p class="MsoNormal"><span lang="EN-GB"> </span></p>
<p class="MsoNormal"><span lang="EN-GB">Lexical Closure is a nice idea. But what
defines representativeness is meant to be criteria at two levels: one external
and the other internal. The external criterion depends on how good a taxonomy
of text categories/genres we have, which has been proved to be extremely
difficult, if not impossible, to formulate. Lexical Closure or Saturation
(Belica 1996) only concerns the internal criterion of the</span><span lang="EN-GB"><span lang="EN-GB"> breadth </span> or coverage of linguistic
(i.e. lexical) features. The genre criterion aims at textual heterogeneity, and
closure measure at linguistic homogeneity. Up to this point, I'm reminded of
stratified random sampling in general statistical sampling. Likewise, a genre
taxonomy based text collection plus a snowballing lexical closure test might
lead to a more balanced corpus. </span></p>
<p class="MsoNormal"><span lang="EN-GB"> </span></p>
<p class="MsoNormal"><span lang="EN-GB">Cf. Lexical closure (McEnery and Wilson,
2001: 173-176); Part-of-speech closure (ibid.: 176-180); Parsing closure
(ibid.: 180-183).</span></p>
<p class="MsoNormal"><span lang="EN-GB"> </span></p>
<p class="MsoNormal"><span lang="EN-GB">Jiajin</span></p>
<p class="MsoNormal"><span lang="EN-GB"> </span></p><p class="MsoNormal"><span lang="EN-GB">--<br></span></p>
<p class="MsoNormal"><span lang="EN-GB">Jiajin XU</span></p>
<p class="MsoNormal"><span lang="EN-GB">Ph.D., Professor</span></p>
<p class="MsoNormal"><span lang="EN-GB">National Research Centre for Foreign
Language Education</span></p>
<p class="MsoNormal"><span lang="EN-GB">Beijing</span><span lang="EN-GB"> Foreign Studies
University</span></p>
<p class="MsoNormal"><span lang="EN-GB">Beijing</span><span lang="EN-GB"> 100089</span></p>
<p class="MsoNormal"><span lang="EN-GB">China</span></p>
</div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Jan 25, 2014 at 10:37 PM, Kevin B. Cohen <span dir="ltr"><<a href="mailto:kevin.cohen@gmail.com" target="_blank">kevin.cohen@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Part of the problem with determining how representative a corpus is is that we don't have good definitions in corpus linguistics of either representativeness or balance--I think that we all sort of think that we know them when we see them, but in looking recently at a number of textbooks on corpus linguistics trying to find definitions of either of these terms, I didn't come up with much. This is a big difference from some other quantitative sciences, where the notion of representativeness has a reasonably clear statistical definition.<br>
<br></div>My colleague Irina Temnikova and I have tried recently to come at the question of representativeness from the opposite angle. We used the work of McEnery and Wilson on closure properties of language to build a tool that looks at the extent to which a corpus represents a sublanguage; if it doesn't look like a sublanguage at all, then we suggest that it looks like it's representative. The paper will appear at LREC this year.<br>
<br></div>Kev<br><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jan 22, 2014 at 12:51 PM, Matías Guzmán Naranjo <span dir="ltr"><<a href="mailto:mortem.dei@gmail.com" target="_blank">mortem.dei@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Dear all,<br></div><div><br>A (not involved in corpus linguistics) college expressed his concerns to me about corpus linguistics, mainly the fact that he thought oral corpora are not really representative of spoken language, and that thus, results of investigations that use oral corpora are not really reliable as reflecting the wider picture of how people speak and use language. My question is whether there have been studies done about how representative are, say phone recordings, or semi-guided interviews, of actual spoken language. <br>
<br>I use oral corpora for my work but just assume that semi-guided interviews are somewhat representative of spoken language outside semi-guided interviews, and that the results do generalize to some degree to the rest of situations, but I ad never really thought about testing this assumption.<br>
<br></div>Best,<br><br></div>Matías Guzmán Naranjo<br></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><span><font color="#888888"><br><br clear="all"><br>-- <br>Kevin Bretonnel Cohen, PhD<br>Biomedical Text Mining Group Lead, Computational Bioscience Program, <br>U. Colorado School of Medicine<br>
<a href="tel:303-916-2417" value="+13039162417" target="_blank">303-916-2417</a> (cell) <a href="tel:303-377-9194" value="+13033779194" target="_blank">303-377-9194</a> (home)<br>
<a href="http://compbio.ucdenver.edu/Hunter_lab/Cohen" target="_blank">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a><br><br><br><br>
</font></span></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br></div></div>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div>
</div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div>