[Corpora-List] Community-driven corpus building
Martin Reynaert
reynaert at uvt.nl
Thu Apr 14 13:26:24 UTC 2011
Dear list,
In the thread 'Spellchecker evaluation corpus', Stefan Bordag just
described a plug-in which strikes me as having far greater potential
than the use he envisages, hence this new thread.
Stefan wrote: "perhaps producing such a corpus wouldn't be so difficult
after all. Perhaps all it takes is a custom plugin for Open Office which
people can use when they review documents they write in OO for errors.
In this plugin, simply by klicking some accept button provided by the
plugin they'd consent to have both the original version and the revised
version sent to some database known to the plugin. With some time
perhaps a sizeable collection of all sorts of corrections in all sorts
of languages could be produced by this.".
What Stefan defines here appears to me to be a killer application for
corpus building.
Setting up this kind of system implies that people donate their texts
and their texts' editing history. The manner in which this is done would
in fact allow for the fully automatic, community-driven building of
corpora of contemporary written text. For any language, for any kind of
corpus research.
This would solve the two major bottle-necks we encounter daily in
building a large reference corpus of contemporary written Dutch:
IPR-settlement and metadata/text processing.
Who better than the author at time of donation to supply the necessary
metadata? :
- personal: allowing the author to determine what level of personal
information (s)he wishes to be associated with the particular text
- text: information about encoding, text type, register, style
- language: with possibility of indicating his/her level of proficiency
- processing: whether spelling/grammar checking was applied, using which
particular tools...
- etc.
However casually mentioned, some types of information listed above are
not and cannot be collected in our corpus, today.
All this metadata could then automatically be incorporated in a suitable
metadata scheme (e.g. CMDI) and the text itself, properly segmented in
sections, paragraphs etc. with proper identification of headers/footers,
tables, pictures, etc. saved in a suitable xml-format and sent on.
Compare this to what one currently obtains automatically converting from
PDF...
The receiving web service would then incorporate the text into the
appropriate subcorpus according e.g. to text type, assign it the proper
file name with the appropriate file number and further make it available
to other web services for furher linguistic enrichment: tokenization,
pos-tagging, automatic correction/normalization, syntactic parsing, etc.
This would also entail gathering the immensely valuable information on
the writing process itself, given the included edit histories, of course.
I have a dream... To which I might add an adjective denoting a high
level humidity. In which case, donating this very text using the service
outlined above, I would naturally attach a low level of divulgence of
personal information within the corpus ;0)
Martin Reynaert
Coordinator Work Package Corpus Building SoNaR
ILK
UvT
The Netherlands
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list