<div dir="ltr"><div><div>Part of the problem with determining how representative a corpus is is that we don't have good definitions in corpus linguistics of either representativeness or balance--I think that we all sort of think that we know them when we see them, but in looking recently at a number of textbooks on corpus linguistics trying to find definitions of either of these terms, I didn't come up with much.  This is a big difference from some other quantitative sciences, where the notion of representativeness has a reasonably clear statistical definition.<br>


<br></div>My colleague Irina Temnikova and I have tried recently to come at the question of representativeness from the opposite angle.  We used the work of McEnery and Wilson on closure properties of language to build a tool that looks at the extent to which a corpus represents a sublanguage; if it doesn't look like a sublanguage at all, then we suggest that it looks like it's representative.  The paper will appear at LREC this year.<br>


<br></div>Kev<br><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jan 22, 2014 at 12:51 PM, Matías Guzmán Naranjo <span dir="ltr"><<a href="mailto:mortem.dei@gmail.com" target="_blank">mortem.dei@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Dear all,<br></div><div><br>A (not involved in corpus linguistics) college expressed his concerns to me about corpus linguistics, mainly the fact that he thought oral corpora are not really representative of spoken language, and that thus, results of investigations that use oral corpora are not really reliable as reflecting the wider picture of how people speak and use language. My question is whether there have been studies done about how representative are, say phone recordings, or semi-guided interviews, of actual spoken language. <br>


<br>I use oral corpora for my work but just assume that semi-guided interviews are somewhat representative of spoken language outside semi-guided interviews, and that the results do generalize to some degree to the rest of situations, but I ad never really thought about testing this assumption.<br>


<br></div>Best,<br><br></div>Matías Guzmán Naranjo<br></div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br>Kevin Bretonnel Cohen, PhD<br>Biomedical Text Mining Group Lead, Computational Bioscience Program, <br>U. Colorado School of Medicine<br>303-916-2417 (cell) 303-377-9194 (home)<br>


<a href="http://compbio.ucdenver.edu/Hunter_lab/Cohen">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a><br><br><br><br>

</div>