Corpora: Re: What is a Corpus

Vladimir Rykov rykov at iling.msk.su
Mon Feb 7 06:32:10 UTC 2000


     It was very interesting for me to read the  "What  is  a  corpus"
discussion.
     Really a problem exists - what is a corpus, is it balanced or/ and
representative.
     If we would take as an example a case of corpus of  proverbs - who
can say  that  this  is a corpus  and  not  archive  or  set or dump of
proverbs? We can find many interesting things at a dump storage -  but
what is  the  value  of our findings?  If we did not any pre-processing
(filtering) during creation of our set of proverbs - then what is  the
value of the following statement: "There are no Italian proverbs about
unlucky marriages" ?
     This statement  is reliable or scientific only for representative
proverb corpus.  Otherwise - "dump as input - dump as output (dust  to
dust)".  Is  there  a  quasi-logical  procedure  of defining - is this
collection (dump) of textual data a representative corpus?  This is  the
starting point of all the following activity - is it scientific one or
paid hobby?




---
    YS Vladimir Rykov, PhD in Computational Linguistics                                       M_M_M_M_M_M_M_M_M_M_M_M_M
 www.blkbox.com/~gigawatt/rykov.html        Linguistic Institute
  WWW.GOL.RU/~iling                   1/12 B.Kislovsky per., Moscow, 103009             KREMLIN WALL IS WHERE YOU MAKE IT !!!
 Please - do NOT send Internet (attached,multimedia etc) files - we can read ASCII files ONLY  Please - send us *.html, *.doc, other non-ASCII files to the addr: ILING at GOL.RU with RE: For Rykov



More information about the Corpora mailing list