[Corpora-List] Free text corpora?

Thu Mar 4 06:16:22 UTC 2010

A bold statement that a single article in NLP can write over the 50 
years of thought by the founding father of the discipline. Could you not 
simply accept that there are different views on what constitutes 
corpora. Neither NLP is nor  Computational linguistics nor corpus 
cognitive linguistics - a contradiction in terms - IS corpus 
linguistics. In research there is room for everyone, and all can bring 
important advances, without arrogantly writing off all that has come before.

We owe a lot to John Sinclair. Now that he is no longer with us to argue 
the case, it is all too easy to take pursue false gods. Has corpora list 
been taken over by a single issue group pursuing their own agenda? I 
hope not. Personally I can work with many different approaches and 
bring, I hope, useful insights, because I know the debt I owe to my 
forefathers. Such attacks seems to come from those who realise their 
position is weak, surely not the case of the father of Sketch Engine?

Best

Geoffrey

Adam Kilgarriff a écrit :
> Sinclair was wrong.
>
> For the argument in more detail see the opening section of Kilgarriff 
> and Grefenstette 2003 Introduction to the Special Issue on Web as 
> Corpus. 
> <http://kilgarriff.co.uk/Publications/2003-KilgGrefenstette-WACIntro.pdf> /Computational 
> Linguistics/ 29 (3)
> (which I wrote before reading Sinclair's revised version: since 
> reading that, I use Sinclair as my protaganist (much more suitable 
> than McEnery and Wilson, whom I quote but who, unlike Sinclair, don't 
> really say anything I disagree with). 
>
> Adam
>
>
> On 3 March 2010 06:45, Geoffrey Williams 
> <geoffrey.williams at univ-ubs.fr <mailto:geoffrey.williams at univ-ubs.fr>> 
> wrote:
>
>     Dear Adam,
>
>     Watering down of a discipline in order to take on board all comers
>     is not
>     a good thing. Far from strangling corpus linguistics Martin is
>     reaffirming
>     its very basis. Sinclair et al's OSTI report laid down the basics.
>     These
>     have been improved over the years, gaining greater clarity, until
>     the 1996
>     EAGLES definition was published. This was a benchmark until Sinclair's
>     2005 revised version, discussed in detail in the book edited by
>     Martin.
>
>     Dumps are about rubbish, either legal or fly dumping. The results
>     is still
>     rubbish in rubbish out. Corpus linguistics has to have a benchmark,
>     otherwisde it would cease to exist.
>
>     However corpus linguistics does not have a monopoly on the word corpus
>     which is somewhat polysemic. If a Wikipedia or other dump works for
>     testing some commercial application or NLP project, then why not
>     use it.
>     On the other hand, don't say that corpus linguistics is being done.
>
>     Corpora list is wealthy through the experience of many who use
>     corpora in
>     different ways. Not all are corpus linguists, neither should they
>     be. It
>     is esential however that those disciplines involved know their
>     research
>     protocols. This is what Martin's timely reminder is about.
>
>     Perfection is difficult to achieve, but that does not make it a less
>     worthwhile goal.
>
>     Best
>
>     Geoffrey
>
>
>     >> ...and a "dump" such as this couldn't be further from
>     qualifying as a
>     > corpus, if defined as "a > collection of pieces of language,
>     selected and
>     > ordered according to explicit linguistic criteria > in order to
>     be used as
>     > a
>     > sample of the language.”
>     >
>     > Sorry, Martin, but your definition of 'corpus' reads like it's
>     designed to
>     > strangle the life out of corpus linguistics. It begs questions about
>     > selection and ordering (?? how does ordering come into it?) and
>     explicit
>     > linguistic criteria, and demotes many things that people refer to as
>     > 'corpora' to a lower form of life.   Lexicographically, bad.
>     >
>     > I think it's a dream of some corpus linguists as to what they
>     think a
>     > corpus
>     > should be, not a fact about how the word is used.  But, delete
>     the middle
>     > clause and we're in agreement:
>     >    "a collection of pieces of language, used as a sample of the
>     language"
>     >
>     > Adam
>     >
>     > On 2 March 2010 22:55, Martin Wynne <martin.wynne at oucs.ox.ac.uk
>     <mailto:martin.wynne at oucs.ox.ac.uk>> wrote:
>     >
>     >> Francis Tyers wrote:
>     >>
>     >>> El dt 02 de 03 de 2010 a les 12:38 +0100, en/na Xin Yan va
>     escriure:
>     >>>
>     >>>
>     >>>
>     >>>> Hello,
>     >>>>
>     >>>> can anyone tell me, if there are some free text corpora for
>     commercial
>     >>>> purpose?
>     >>>> Thank you in advance!
>     >>>>
>     >>>>
>     >>>
>     >>> You can download dumps of Wikipedia from
>     http://download.wikimedia.org
>     >>> -- they are licensed under the CC-BY-SA or GFDL -- both of
>     which allow
>     >>> commercial use, providing changes made are redistributed under
>     the same
>     >>> licence.
>     >>>
>     >>> Best regards,
>     >>>
>     >>> Fran
>     >>>
>     >>>
>     >>>
>     >>> _______________________________________________
>     >>> Corpora mailing list
>     >>> Corpora at uib.no <mailto:Corpora at uib.no>
>     >>> http://mailman.uib.no/listinfo/corpora
>     >>>
>     >>>
>     >>
>     >> Dumps of wikipedia may be an interesting electronic text
>     collection that
>     >> can be used to help address various linguistic research
>     questions, but I
>     >> think that the request was for a corpus...and a "dump" such as this
>     >> couldn't
>     >> be further from qualifying as a corpus, if defined as "a
>     collection of
>     >> pieces of language, selected and ordered according to explicit
>     >> linguistic
>     >> criteria in order to be used as a sample of the language.”
>     >>
>     >> The good news is that corpora are available. If you let us know
>     what
>     >> sort
>     >> of corpus you are looking for and for what sort of commercial
>     uses you
>     >> intend to put them to, I am sure that there are plenty of
>     people here on
>     >> the
>     >> mailing list who can help you.
>     >>
>     >> Martin
>     >> Oxford Text Archive
>     >>
>     >>
>     >> _______________________________________________
>     >> Corpora mailing list
>     >> Corpora at uib.no <mailto:Corpora at uib.no>
>     >> http://mailman.uib.no/listinfo/corpora
>     >>
>     >
>     >
>     >
>     > --
>     > ================================================
>     > Adam Kilgarriff
>     > http://www.kilgarriff.co.uk
>     > Lexical Computing Ltd                  
>     http://www.sketchengine.co.uk
>     > Lexicography MasterClass Ltd      http://www.lexmasterclass.com
>     > Universities of Leeds and Sussex       adam at lexmasterclass.com
>     <mailto:adam at lexmasterclass.com>
>     > ================================================
>     > _______________________________________________
>     > Corpora mailing list
>     > Corpora at uib.no <mailto:Corpora at uib.no>
>     > http://mailman.uib.no/listinfo/corpora
>     >
>
>
>     --
>     Prof. Geoffrey Williams,
>     Vice Président des Relations Internationales
>     Professeur des universités en sciences du langage
>     Directeur du département d'ingénierie du document
>     UFR de Lettres, Sciences Humaines et Sociales
>     Université de Bretagne-Sud
>     4 rue Jean Zay,
>     BP92116
>     F-56321 LORIENT Cedex
>
>     tél: +33 (0) 2 97 87 29 20
>     fax: +33 (0) 2 97 87 65 25
>
>
>
>
> -- 
> ================================================
> Adam Kilgarriff                                     
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com 
> <mailto:adam at lexmasterclass.com>
> ================================================
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora