[Corpora-List] Testing how representative a particular corpus is

Kevin B. Cohen kevin.cohen at gmail.com
Sat Jan 25 14:37:25 UTC 2014


Part of the problem with determining how representative a corpus is is that
we don't have good definitions in corpus linguistics of either
representativeness or balance--I think that we all sort of think that we
know them when we see them, but in looking recently at a number of
textbooks on corpus linguistics trying to find definitions of either of
these terms, I didn't come up with much.  This is a big difference from
some other quantitative sciences, where the notion of representativeness
has a reasonably clear statistical definition.

My colleague Irina Temnikova and I have tried recently to come at the
question of representativeness from the opposite angle.  We used the work
of McEnery and Wilson on closure properties of language to build a tool
that looks at the extent to which a corpus represents a sublanguage; if it
doesn't look like a sublanguage at all, then we suggest that it looks like
it's representative.  The paper will appear at LREC this year.

Kev



On Wed, Jan 22, 2014 at 12:51 PM, Matías Guzmán Naranjo <
mortem.dei at gmail.com> wrote:

> Dear all,
>
> A (not involved in corpus linguistics) college expressed his concerns to
> me about corpus linguistics, mainly the fact that he thought oral corpora
> are not really representative of spoken language, and that thus, results of
> investigations that use oral corpora are not really reliable as reflecting
> the wider picture of how people speak and use language. My question is
> whether there have been studies done about how representative are, say
> phone recordings, or semi-guided interviews, of actual spoken language.
>
> I use oral corpora for my work but just assume that semi-guided interviews
> are somewhat representative of spoken language outside semi-guided
> interviews, and that the results do generalize to some degree to the rest
> of situations, but I ad never really thought about testing this assumption.
>
> Best,
>
> Matías Guzmán Naranjo
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program,
U. Colorado School of Medicine
303-916-2417 (cell) 303-377-9194 (home)
http://compbio.ucdenver.edu/Hunter_lab/Cohen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140125/79255e5a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list