[Corpora-List] Corpus-building for minority languages

Kevin Patrick Scannell scannell at slu.edu
Fri Mar 19 16:01:12 UTC 2004


I've developed some simple web crawling software
that is designed to build corpora for minority
languages quickly and inexpensively.  See:

http://borel.slu.edu/crubadan/

Thusfar it has been deployed in earnest only for Welsh
(now approaching 50 million words) and Irish
(15 million words).   The Welsh corpus is being
used by the lexicographers at the University of Wales
Dictionary of the Welsh Language:

http://www.aber.ac.uk/~gpcwww/

Of course the texts harvested in this way are
not statistically representative in any sense.
Nevertheless they are good for lexicography and
number-crunching for natural language processing.
And extracting useful subsets shouldn't be hard;
I've done some of this for the Irish corpus
already.

The software has proved to be quite portable
across languages; it (very roughly) bootstraps
the language model from some initial "seed" texts
(or even better an initial word list).
I've done some experimentaion with several other
languages: Catalan, Swahili, Maori, Faroese,
Scottish Gaelic, Walloon, Breton, Cebuano, and Manx
Gaelic.   You can see some results on the
status page:

http://borel.slu.edu/crubadan/stadas.html

Please send me an email if you'd be interested
in helping develop one of these corpora or in
trying a new language.

-Kevin



More information about the Corpora mailing list