[Corpora-List] Do we still need language corpora?

Martin Mueller martin.mueller at mac.com
Fri Feb 4 14:30:21 UTC 2011


Diachronic corpora pose problems and opportunities of their own. As you go back in time, orthographic and typographical variance increas. At the same time, past corpora are in principle finite, and once you move back in time beyond 1800 the rates of production decrease sharply. Thus it is imaginable to have corpora of past epochs that include much or all of what has survived. It is worthwhile and for many purposes necessary to do a lot of data curation to make the textual data fully comparable and explorable. 

A big question about corpora is "more of what?"  If storage space and bandwidth are cheap, it is tempting to just get more stuff of whatever it is. And there are certainly scenarios where more is always better because some rare phenomenon might turn up for the first time in the next batch of "more." 

But for other inquiries, it is likely to remain true that you have to do a lot to you data before you can do useful things with them.  Data curation is as tedious as prepping your house before painting. I suspect that just getting more stuff is often an excuse for avoiding the boring work of getting your stuff into the shape that makes it yield results.

Is that too cynical?
On Feb 3, 2011, at 10:43 AM, Martin Wynne wrote:

> Do we still need language corpora? Is the selection and preparation of the carefully crafted corpus a waste of time and money these days when large amounts of language data are freely available on the web?
> 
> If you are interested in this question then come along to a debate, organised by Martin Wynne and Ylva Berglund (Oxford) on the afternoon of June 1st 2011, and as a pre-conference event at the ICAME conference in Oslo.
> 
> More details are available at:
> 
> http://www.hf.uio.no/ilos/english/research/conferences/2011/icame2011/workshops.html
> 
> The session will be a formal debate, with two speakers for and two against the motion, with the opportunity for questions from the floor, and a summing up by the speakers, ending with a vote by the audience. The motion will be:
> 
> "Language corpora are no longer necessary for linguistic research."
> 
> All participants in the ICAME conference are warmly encouraged to come along and participate in what promises to be an entertaining and important debate on a key question confronting corpus linguistics today.
> 
> Time and place
> 
> The workshop will be held in the afternoon between 14.00 and 16.00 on Wednesday 1 June at the conference hotel and main venue, the Clarion Royal Christiania Hotel, located in Oslo city centre. After the workshop, the conference proper will start at 17.00 with the opening plenary by David Crystal in the Old Ceremonial Theatre of the University of Oslo (also in the city centre). Participants are required to register for the main conference, but attendance at the workshops is free. Please indicate on the registration form if you intend to attend the conference. The conference website is at:
> 
> http://www.hf.uio.no/ilos/english/research/conferences/2011/icame2011/index.html
> 
> Discussion of the proposition is also welcome on this list!
> 
> -- 
> Martin Wynne
> Research Technologies Service&
> Oxford e-Research Centre
> 
> Oxford University Computing Services
> 7-19 Banbury Road
> Oxford
> UK - OX2 6NN
> Tel: +44 1865 283299 or +44 1865 610677
> Fax: +44 1865 273275
> martin.wynne at oucs.ox.ac.uk
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list