[Corpora-List] Do we still need language corpora?

Martin Reynaert reynaert at uvt.nl
Thu Feb 3 20:52:05 UTC 2011


Dear Corpora List,

As one who is actively engaged in building a more 'old-style' corpus for 
Dutch, I do not particularly feel the need the enquire whether this is 
an open invitation for debate.

A very important consideration in this regard is in my opinion the open 
or not-so-open availability for at least research purposes of the 
compiled corpus. This is ultimately a question of replicability of 
research findings.

In the Netherlands and the Netherlandic part of Belgium, i.e. Flanders, 
we are currently building a transnational corpus of contemporary written 
Dutch. Its size is to be five times that of the BNC, i.e. 500 million 
word tokens. While the intent is 'old'-style' its contents are squarely 
oriented at 'new media' as well as the traditional text types.

The corpus is being built under the aegis of the Dutch Language Union 
and is meant to fill an important gap in the ''Basic Language  Resource 
Kit" for Dutch. I will not here repeat this term's acronym as I 
personally regard it as a very good candidate for the ugliest word in 
English (more info and interesting links to be found here: 
http://lands.let.ru.nl/~strik/research/BLaRK.html) .

SoNaR, the reference corpus I help being built, is also meant to help 
safeguard the future of Dutch as such in light of the current 
encroachment of English on our language and, arguably, culture.

In an LREC paper last year we have tried to list our preliminary 
findings as regards the differences between our endeavours and e.g. web 
corpus building efforts. In our approach, IPR-issues are cleared for 
every document incorporated in the corpus, which ensures open 
availability of the corpus and will help to ensure that results can be 
independently verified and/or duplicated.

Our paper is available here:

http://ilk.uvt.nl/publications/

Look under publications, 2010 for: 'Balancing SoNaR: IPR versus 
Processing Issues in a 500-Million-Word Written Dutch Reference Corpus'.

I post this in the hope of fostering and perhaps helping to guide 
further discussion about this on the list.

Yours,

Martin Reynaert
Coordinator SoNaR, work package Corpus Building
ILK
Tilburg University
The Netherlands

More info about the STEVIN programme of the Dutch Language Union: 
http://taalunieversum.org/taal/technologie/stevin/english/

A little more (and not very usefully only in Dutch...) info about SoNaR: 
http://lands.let.ru.nl/projects/SoNaR/intro.html


Martin Wynne wrote:
> Do we still need language corpora? Is the selection and preparation of 
> the carefully crafted corpus a waste of time and money these days when 
> large amounts of language data are freely available on the web?
>
> If you are interested in this question then come along to a debate, 
> organised by Martin Wynne and Ylva Berglund (Oxford) on the afternoon 
> of June 1st 2011, and as a pre-conference event at the ICAME 
> conference in Oslo.
>
> More details are available at:
>
>  http://www.hf.uio.no/ilos/english/research/conferences/2011/icame2011/workshops.html 
>
>
> The session will be a formal debate, with two speakers for and two 
> against the motion, with the opportunity for questions from the floor, 
> and a summing up by the speakers, ending with a vote by the audience. 
> The motion will be:
>
> "Language corpora are no longer necessary for linguistic research."
>
> All participants in the ICAME conference are warmly encouraged to come 
> along and participate in what promises to be an entertaining and 
> important debate on a key question confronting corpus linguistics today.
>
> Time and place
>
> The workshop will be held in the afternoon between 14.00 and 16.00 on 
> Wednesday 1 June at the conference hotel and main venue, the Clarion 
> Royal Christiania Hotel, located in Oslo city centre. After the 
> workshop, the conference proper will start at 17.00 with the opening 
> plenary by David Crystal in the Old Ceremonial Theatre of the 
> University of Oslo (also in the city centre). Participants are 
> required to register for the main conference, but attendance at the 
> workshops is free. Please indicate on the registration form if you 
> intend to attend the conference. The conference website is at:
>
>  http://www.hf.uio.no/ilos/english/research/conferences/2011/icame2011/index.html 
>
>
> Discussion of the proposition is also welcome on this list!
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list