[Corpora-List] Free text corpora?

Magnar Brekke Magnar.Brekke at nhh.no
Wed Mar 3 15:41:21 UTC 2010


Without stepping directly into this controversy I would like to argue for a Wittgensteinian position: whether or not a given object qualifies as a corpus depends not on its material constitution but on the uses to which it is put - and the rest follows from that. Thus a person traveling from town to town giving recitals of (excerpts from) the complete works of Jane Austin would most likely be not a corpus linguist but an actor. Should someone wish to offer a generalization on (say) "Anomalies in Jane Austin's use of emotive adjectives in postmodifying position" based on Austin's complete works, that person would most likely be a corpus linguist.
 
Regards,
Magnar Brekke

________________________________

Fra: corpora-bounces at uib.no på vegne av Adam Kilgarriff
Sendt: on 03.03.2010 11:25
Til: Geoffrey Williams
Kopi: Xin Yan; corpora at uib.no
Emne: Re: [Corpora-List] Free text corpora?


Sinclair was wrong. 

For the argument in more detail see the opening section of Kilgarriff and Grefenstette 2003 Introduction to the Special Issue on Web as Corpus. <http://kilgarriff.co.uk/Publications/2003-KilgGrefenstette-WACIntro.pdf>  Computational Linguistics 29 (3)
(which I wrote before reading Sinclair's revised version: since reading that, I use Sinclair as my protaganist (much more suitable than McEnery and Wilson, whom I quote but who, unlike Sinclair, don't really say anything I disagree with). 

Adam


On 3 March 2010 06:45, Geoffrey Williams <geoffrey.williams at univ-ubs.fr> wrote:


	Dear Adam,
	
	Watering down of a discipline in order to take on board all comers is not
	a good thing. Far from strangling corpus linguistics Martin is reaffirming
	its very basis. Sinclair et al's OSTI report laid down the basics. These
	have been improved over the years, gaining greater clarity, until the 1996
	EAGLES definition was published. This was a benchmark until Sinclair's
	2005 revised version, discussed in detail in the book edited by Martin.
	
	Dumps are about rubbish, either legal or fly dumping. The results is still
	rubbish in rubbish out. Corpus linguistics has to have a benchmark,
	otherwisde it would cease to exist.
	
	However corpus linguistics does not have a monopoly on the word corpus
	which is somewhat polysemic. If a Wikipedia or other dump works for
	testing some commercial application or NLP project, then why not use it.
	On the other hand, don't say that corpus linguistics is being done.
	
	Corpora list is wealthy through the experience of many who use corpora in
	different ways. Not all are corpus linguists, neither should they be. It
	is esential however that those disciplines involved know their research
	protocols. This is what Martin's timely reminder is about.
	
	Perfection is difficult to achieve, but that does not make it a less
	worthwhile goal.
	
	Best
	
	Geoffrey
	


	>> ...and a "dump" such as this couldn't be further from qualifying as a
	> corpus, if defined as "a > collection of pieces of language, selected and
	> ordered according to explicit linguistic criteria > in order to be used as
	> a
	> sample of the language."
	>
	> Sorry, Martin, but your definition of 'corpus' reads like it's designed to
	> strangle the life out of corpus linguistics. It begs questions about
	> selection and ordering (?? how does ordering come into it?) and explicit
	> linguistic criteria, and demotes many things that people refer to as
	> 'corpora' to a lower form of life.   Lexicographically, bad.
	>
	> I think it's a dream of some corpus linguists as to what they think a
	> corpus
	> should be, not a fact about how the word is used.  But, delete the middle
	> clause and we're in agreement:
	>    "a collection of pieces of language, used as a sample of the language"
	>
	> Adam
	>
	> On 2 March 2010 22:55, Martin Wynne <martin.wynne at oucs.ox.ac.uk> wrote:
	>
	>> Francis Tyers wrote:
	>>
	>>> El dt 02 de 03 de 2010 a les 12:38 +0100, en/na Xin Yan va escriure:
	>>>
	>>>
	>>>
	>>>> Hello,
	>>>>
	>>>> can anyone tell me, if there are some free text corpora for commercial
	>>>> purpose?
	>>>> Thank you in advance!
	>>>>
	>>>>
	>>>
	>>> You can download dumps of Wikipedia from http://download.wikimedia.org <http://download.wikimedia.org/> 
	>>> -- they are licensed under the CC-BY-SA or GFDL -- both of which allow
	>>> commercial use, providing changes made are redistributed under the same
	>>> licence.
	>>>
	>>> Best regards,
	>>>
	>>> Fran
	>>>
	>>>
	>>>
	>>> _______________________________________________
	>>> Corpora mailing list
	>>> Corpora at uib.no
	>>> http://mailman.uib.no/listinfo/corpora
	>>>
	>>>
	>>
	>> Dumps of wikipedia may be an interesting electronic text collection that
	>> can be used to help address various linguistic research questions, but I
	>> think that the request was for a corpus...and a "dump" such as this
	>> couldn't
	>> be further from qualifying as a corpus, if defined as "a collection of
	>> pieces of language, selected and ordered according to explicit
	>> linguistic
	>> criteria in order to be used as a sample of the language."
	>>
	>> The good news is that corpora are available. If you let us know what
	>> sort
	>> of corpus you are looking for and for what sort of commercial uses you
	>> intend to put them to, I am sure that there are plenty of people here on
	>> the
	>> mailing list who can help you.
	>>
	>> Martin
	>> Oxford Text Archive
	>>
	>>
	>> _______________________________________________
	>> Corpora mailing list
	>> Corpora at uib.no
	>> http://mailman.uib.no/listinfo/corpora
	>>
	>
	>
	>
	> --
	> ================================================
	> Adam Kilgarriff
	> http://www.kilgarriff.co.uk <http://www.kilgarriff.co.uk/> 
	> Lexical Computing Ltd                   http://www.sketchengine.co.uk <http://www.sketchengine.co.uk/> 
	> Lexicography MasterClass Ltd      http://www.lexmasterclass.com <http://www.lexmasterclass.com/> 
	> Universities of Leeds and Sussex       adam at lexmasterclass.com
	> ================================================
	> _______________________________________________
	> Corpora mailing list
	> Corpora at uib.no
	> http://mailman.uib.no/listinfo/corpora
	>
	
	
	
	--
	Prof. Geoffrey Williams,
	Vice Président des Relations Internationales
	Professeur des universités en sciences du langage
	Directeur du département d'ingénierie du document
	UFR de Lettres, Sciences Humaines et Sociales
	Université de Bretagne-Sud
	4 rue Jean Zay,
	BP92116
	F-56321 LORIENT Cedex
	
	tél: +33 (0) 2 97 87 29 20
	fax: +33 (0) 2 97 87 65 25
	
	




-- 
================================================
Adam Kilgarriff                                      http://www.kilgarriff.co.uk <http://www.kilgarriff.co.uk/>               
Lexical Computing Ltd                   http://www.sketchengine.co.uk <http://www.sketchengine.co.uk/> 
Lexicography MasterClass Ltd      http://www.lexmasterclass.com <http://www.lexmasterclass.com/> 
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100303/3b383df8/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list