[Corpora-List] Free text corpora?

Yannick Versley versley at sfs.uni-tuebingen.de
Wed Mar 3 09:53:25 UTC 2010


Dear Geoffrey,

I think we should acknowledge here that corpus linguistics has come to mean 
two things:
* one is the use of corpora to gain quantitative data on phenomena that are
   otherwise hard to pin down, in part or wholly because they are
   dependent on genre or time frame.
* the other is the use of corpora as an additional information source for 
   linguistic questions which are not meant to be genre-dependent and
   additionally are corroborated with other sources of information (e.g.,
   introspection/thought experiments, elicitation of speaker judgements,
   or experiments related to processing effects).
Your claim that
> Dumps are about rubbish, either legal or fly dumping. The results is still
> rubbish in rubbish out. Corpus linguistics has to have a benchmark,
> otherwisde it would cease to exist.
are tantamount to a dialectologist saying that the way most people elicit 
speaker judgments is rubbish because they only ever ask people from one
city.

It should be clear that there are linguistic issues which are dependent on 
dialect, genre, time frame, and other factors, and that having a 
well-balanced corpus can help reduce the risk of blindly running into 
overgeneralizing a phenomenon that actually is limited to some sub-part of 
language. Then again, you should always look at empirical data with a modicum 
of common sense, as even carefully balanced corpora tend not to be 
representative of the finest genre distinctions - e.g., car repair manuals 
standing in as representative for all kinds of repair manuals, one magazine 
(possibly containing the idiolect of only a small group of people) standing 
in for magazines in general, or one kind of talk interactions (people doing 
small talk with a linguist nearby) standing in for all "oral" language.
That kind of "perfection" when it comes to balancing corpora would not merely
be difficult to achieve, it is a pure utopia.
Which means that any statistical regularities, even if you gained them from a 
carefully curated corpus, have to be taken with a large grain of salt.
NLP practitioners using corpora are used to taking in large amounts of salt in 
that way, because statistical models work best if you throw lots and lots of 
data at them; similarly lexicographers, and linguists investigating rare 
phenomena that maybe occur only once every million words.
If you're the first kind of corpus linguist, though, careful composition beats 
quantity (and Biber et al's Longman Grammar of English is a good example how
this may make sense), but you shouldn't overgeneralize your own requirements
to the population at large, or monopolize terms such as "corpus linguistics".

Best,
Yannick

>
> However corpus linguistics does not have a monopoly on the word corpus
> which is somewhat polysemic. If a Wikipedia or other dump works for
> testing some commercial application or NLP project, then why not use it.
> On the other hand, don't say that corpus linguistics is being done.
>
> Corpora list is wealthy through the experience of many who use corpora in
> different ways. Not all are corpus linguists, neither should they be. It
> is esential however that those disciplines involved know their research
> protocols. This is what Martin's timely reminder is about.
>
> Perfection is difficult to achieve, but that does not make it a less
> worthwhile goal.
>
> Best
>
> Geoffrey
>
> >> ...and a "dump" such as this couldn't be further from qualifying as a
> >
> > corpus, if defined as "a > collection of pieces of language, selected and
> > ordered according to explicit linguistic criteria > in order to be used
> > as a
> > sample of the language.”
> >
> > Sorry, Martin, but your definition of 'corpus' reads like it's designed
> > to strangle the life out of corpus linguistics. It begs questions about
> > selection and ordering (?? how does ordering come into it?) and explicit
> > linguistic criteria, and demotes many things that people refer to as
> > 'corpora' to a lower form of life.   Lexicographically, bad.
> >
> > I think it's a dream of some corpus linguists as to what they think a
> > corpus
> > should be, not a fact about how the word is used.  But, delete the middle
> > clause and we're in agreement:
> >    "a collection of pieces of language, used as a sample of the language"
> >
> > Adam
> >
> > On 2 March 2010 22:55, Martin Wynne <martin.wynne at oucs.ox.ac.uk> wrote:
> >> Francis Tyers wrote:
> >>> El dt 02 de 03 de 2010 a les 12:38 +0100, en/na Xin Yan va escriure:
> >>>> Hello,
> >>>>
> >>>> can anyone tell me, if there are some free text corpora for commercial
> >>>> purpose?
> >>>> Thank you in advance!
> >>>
> >>> You can download dumps of Wikipedia from http://download.wikimedia.org
> >>> -- they are licensed under the CC-BY-SA or GFDL -- both of which allow
> >>> commercial use, providing changes made are redistributed under the same
> >>> licence.
> >>>
> >>> Best regards,
> >>>
> >>> Fran
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Corpora mailing list
> >>> Corpora at uib.no
> >>> http://mailman.uib.no/listinfo/corpora
> >>
> >> Dumps of wikipedia may be an interesting electronic text collection that
> >> can be used to help address various linguistic research questions, but I
> >> think that the request was for a corpus...and a "dump" such as this
> >> couldn't
> >> be further from qualifying as a corpus, if defined as "a collection of
> >> pieces of language, selected and ordered according to explicit
> >> linguistic
> >> criteria in order to be used as a sample of the language.”
> >>
> >> The good news is that corpora are available. If you let us know what
> >> sort
> >> of corpus you are looking for and for what sort of commercial uses you
> >> intend to put them to, I am sure that there are plenty of people here on
> >> the
> >> mailing list who can help you.
> >>
> >> Martin
> >> Oxford Text Archive
> >>
> >>
> >> _______________________________________________
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >
> > --
> > ================================================
> > Adam Kilgarriff
> > http://www.kilgarriff.co.uk
> > Lexical Computing Ltd                   http://www.sketchengine.co.uk
> > Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> > Universities of Leeds and Sussex       adam at lexmasterclass.com
> > ================================================
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora



-- 
SFB 833 "Bedeutungskonstitution"
Nauklerstr. 35 - D-72074 Tübingen
Tel.: +49-7071-77155; Fax: +49-7071-5830

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list