[Corpora-List] Testing how representative a particular corpus is

Michal Kren michal.kren at ff.cuni.cz
Mon Jan 27 20:53:47 UTC 2014


On Mon 27 Jan 2014 08:51:08, Mike Scott wrote:
> 
> On 27/01/2014 01:33, Angus Grieve-Smith wrote:
> >     Right.  Here's what I don't get: Why hasn't anyone followed even a 
> > single speaker around, let alone a representative sample, to see what 
> > proportion of registers and genres they're exposed to on a daily 
> > basis?  Or has this been done?
> 
> I think the Czech National Corpus people did (something like) that for 
> both written and spoken Czech, in order to help them build up their 
> corpus. Anyone from Prague able to confirm that?
> 
> Cheers -- Mike
> 
> 

What was done here in Prague some 15 years ago was a couple of
surveys of what people read (no spoken language) in order to get an
estimate of proportions of written language varieties people are
exposed to.

Personally, I don't think the results were very convincing (for a number
of reasons already mentioned in the thread, and especially from today's
point of view), but it was at least an attempt to get some solid ground
for the design of balanced written corpora being compiled at the time.

Regards

Michal Křen
Institute of the Czech National Corpus
Faculty of Arts
Charles University in Prague
Nám. Jana Palacha 2
116 38 Praha 1
Czech Republic
http://www.korpus.cz


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list