Kevin Patrick Scannel: Celtic language corpora

Elizabeth J. Pyatt ejp10 at psu.edu
Fri Mar 19 18:07:16 UTC 2004


Delivered-To: celtling at listserv.linguistlist.org
From: Kevin Patrick Scannell <scannell at slu.edu>
Reply-To: scannell at slu.edu
To: CELTLING at LISTSERV.LINGUISTLIST.ORG
Subject: Celtic language corpora
Date: Fri, 19 Mar 2004 09:59:24 -0600

I've developed some simple web crawling software
that is designed to build corpora for minority
languages quickly and inexpensively.  See:

http://borel.slu.edu/crubadan/

Thusfar it has been deployed in earnest only for Welsh
(now approaching 50 million words) and Irish
(15 million words).   The Welsh corpus is being
used by the lexicographers at the University of Wales
Dictionary of the Welsh Language:

http://www.aber.ac.uk/~gpcwww/

Of course the texts harvested in this way are
not statistically representative in any sense.
Nevertheless they are good for lexicography and
number-crunching for natural language processing.
And extracting useful subsets shouldn't be hard;
I've done some of this for the Irish corpus
already.

The software has proved to be quite portable
across languages; it (very roughly) bootstraps
the language model from some initial "seed" texts
(or even better an initial word list).
I've done some experimentaion with several other
languages: Catalan, Swahili, Maori, Faroese,
Scottish Gaelic, Walloon, Breton, Cebuano, and Manx
Gaelic.   You can see some results on the
status page:

http://borel.slu.edu/crubadan/stadas.html

Please send me an email if you'd be interested
in helping develop one of these corpora or in
trying a new language.

-Kevin




--
o.o.o.o.o.o.o.o.o.o

CELTLING
Post: celtling at lists.linguistlist.org OR celtling at listserv.linguistlist.org
Archives: <http://listserv.linguistlist.org/archives/celtling.html>
Subscribe/Unsubscribe - Go to Archives, then click "Join or leave" link

Website: <http://www.personal.psu.edu/ejp10/celtling>



More information about the Celtling mailing list