[Corpora-List] Free text corpora?

Adam Kilgarriff adam at lexmasterclass.com
Wed Mar 3 10:25:11 UTC 2010


Sinclair was wrong.

For the argument in more detail see the opening section of Kilgarriff and
Grefenstette 2003 Introduction to the Special Issue on Web as
Corpus.<http://kilgarriff.co.uk/Publications/2003-KilgGrefenstette-WACIntro.pdf>
 *Computational Linguistics* 29 (3)
(which I wrote before reading Sinclair's revised version: since reading
that, I use Sinclair as my protaganist (much more suitable than McEnery and
Wilson, whom I quote but who, unlike Sinclair, don't really say anything I
disagree with).

Adam


On 3 March 2010 06:45, Geoffrey Williams <geoffrey.williams at univ-ubs.fr>wrote:

> Dear Adam,
>
> Watering down of a discipline in order to take on board all comers is not
> a good thing. Far from strangling corpus linguistics Martin is reaffirming
> its very basis. Sinclair et al's OSTI report laid down the basics. These
> have been improved over the years, gaining greater clarity, until the 1996
> EAGLES definition was published. This was a benchmark until Sinclair's
> 2005 revised version, discussed in detail in the book edited by Martin.
>
> Dumps are about rubbish, either legal or fly dumping. The results is still
> rubbish in rubbish out. Corpus linguistics has to have a benchmark,
> otherwisde it would cease to exist.
>
> However corpus linguistics does not have a monopoly on the word corpus
> which is somewhat polysemic. If a Wikipedia or other dump works for
> testing some commercial application or NLP project, then why not use it.
> On the other hand, don't say that corpus linguistics is being done.
>
> Corpora list is wealthy through the experience of many who use corpora in
> different ways. Not all are corpus linguists, neither should they be. It
> is esential however that those disciplines involved know their research
> protocols. This is what Martin's timely reminder is about.
>
> Perfection is difficult to achieve, but that does not make it a less
> worthwhile goal.
>
> Best
>
> Geoffrey
>
>
> >> ...and a "dump" such as this couldn't be further from qualifying as a
> > corpus, if defined as "a > collection of pieces of language, selected and
> > ordered according to explicit linguistic criteria > in order to be used
> as
> > a
> > sample of the language.”
> >
> > Sorry, Martin, but your definition of 'corpus' reads like it's designed
> to
> > strangle the life out of corpus linguistics. It begs questions about
> > selection and ordering (?? how does ordering come into it?) and explicit
> > linguistic criteria, and demotes many things that people refer to as
> > 'corpora' to a lower form of life.   Lexicographically, bad.
> >
> > I think it's a dream of some corpus linguists as to what they think a
> > corpus
> > should be, not a fact about how the word is used.  But, delete the middle
> > clause and we're in agreement:
> >    "a collection of pieces of language, used as a sample of the language"
> >
> > Adam
> >
> > On 2 March 2010 22:55, Martin Wynne <martin.wynne at oucs.ox.ac.uk> wrote:
> >
> >> Francis Tyers wrote:
> >>
> >>> El dt 02 de 03 de 2010 a les 12:38 +0100, en/na Xin Yan va escriure:
> >>>
> >>>
> >>>
> >>>> Hello,
> >>>>
> >>>> can anyone tell me, if there are some free text corpora for commercial
> >>>> purpose?
> >>>> Thank you in advance!
> >>>>
> >>>>
> >>>
> >>> You can download dumps of Wikipedia from http://download.wikimedia.org
> >>> -- they are licensed under the CC-BY-SA or GFDL -- both of which allow
> >>> commercial use, providing changes made are redistributed under the same
> >>> licence.
> >>>
> >>> Best regards,
> >>>
> >>> Fran
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Corpora mailing list
> >>> Corpora at uib.no
> >>> http://mailman.uib.no/listinfo/corpora
> >>>
> >>>
> >>
> >> Dumps of wikipedia may be an interesting electronic text collection that
> >> can be used to help address various linguistic research questions, but I
> >> think that the request was for a corpus...and a "dump" such as this
> >> couldn't
> >> be further from qualifying as a corpus, if defined as "a collection of
> >> pieces of language, selected and ordered according to explicit
> >> linguistic
> >> criteria in order to be used as a sample of the language.”
> >>
> >> The good news is that corpora are available. If you let us know what
> >> sort
> >> of corpus you are looking for and for what sort of commercial uses you
> >> intend to put them to, I am sure that there are plenty of people here on
> >> the
> >> mailing list who can help you.
> >>
> >> Martin
> >> Oxford Text Archive
> >>
> >>
> >> _______________________________________________
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> >
> >
> >
> > --
> > ================================================
> > Adam Kilgarriff
> > http://www.kilgarriff.co.uk
> > Lexical Computing Ltd                   http://www.sketchengine.co.uk
> > Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> > Universities of Leeds and Sussex       adam at lexmasterclass.com
> > ================================================
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>
>
> --
> Prof. Geoffrey Williams,
> Vice Président des Relations Internationales
> Professeur des universités en sciences du langage
> Directeur du département d'ingénierie du document
> UFR de Lettres, Sciences Humaines et Sociales
> Université de Bretagne-Sud
> 4 rue Jean Zay,
> BP92116
> F-56321 LORIENT Cedex
>
> tél: +33 (0) 2 97 87 29 20
> fax: +33 (0) 2 97 87 65 25
>
>


-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100303/dcf18029/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list