[Corpora-List] Community-driven corpus building

Thu Apr 14 13:26:24 UTC 2011

Dear list,

In the thread 'Spellchecker evaluation corpus', Stefan Bordag just 
described a plug-in which strikes me as having far greater potential 
than the use he envisages, hence this new thread.

Stefan wrote: "perhaps producing such a corpus wouldn't be so difficult 
after all. Perhaps all it takes is a custom plugin for Open Office which 
people can use when they review documents they write in OO for errors. 
In this plugin, simply by klicking some accept button provided by the 
plugin they'd consent to have both the original version and the revised 
version sent to some database known to the plugin. With some time 
perhaps a sizeable collection of all sorts of corrections in all sorts 
of languages could be produced by this.".

What Stefan defines here appears to me to be a killer application for 
corpus building.

Setting up this kind of system implies that people donate their texts 
and their texts' editing history. The manner in which this is done would 
in fact allow for the fully automatic, community-driven building of 
corpora of contemporary written text. For any language, for any kind of 
corpus research.

This would solve the two major bottle-necks we encounter daily in 
building a large reference corpus of contemporary written Dutch: 
IPR-settlement and metadata/text processing.

Who better than the author at time of donation to supply the necessary 
metadata? :

- personal: allowing the author to determine what level of personal 
information (s)he wishes to be associated with the particular text
- text: information about encoding, text type, register, style
- language: with possibility of indicating his/her level of proficiency
- processing: whether spelling/grammar checking was applied, using which 
particular tools...
- etc.

However casually mentioned, some types of information listed above are 
not and cannot be collected in our corpus, today.

All this metadata could then automatically be incorporated in a suitable 
metadata scheme (e.g. CMDI) and the text itself, properly segmented in 
sections, paragraphs etc. with proper identification of headers/footers, 
tables, pictures, etc. saved in a suitable xml-format and sent on. 
Compare this to what one currently obtains automatically converting from 
PDF...

The receiving web service would then incorporate the text into the 
appropriate subcorpus according e.g. to text type, assign it the proper 
file name with the appropriate file number and further make it available 
to other web services for furher linguistic enrichment: tokenization, 
pos-tagging, automatic correction/normalization, syntactic parsing, etc. 
This would also entail gathering the immensely valuable information on 
the writing process itself, given the included edit histories, of course.

I have a dream... To which I might add an adjective denoting a high 
level humidity. In which case, donating this very text using the service 
outlined above, I would naturally attach a low level of divulgence of 
personal information within the corpus ;0)

Martin Reynaert
Coordinator Work Package Corpus Building SoNaR
ILK
UvT
The Netherlands

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora