Online etymological databases

Mon Aug 23 00:34:11 UTC 1999

On Thu, 12 Aug 1999, Jon Patrick wrote:

> Our method should be useful to appraise competing reconstructions of earlier
> languages,say Indoeuropean, however to date we have not been able to find the
> necessary data compiled in one place to easily apply it. Should anyone have a
> good database of appropriate data we would be happy to submit it to our
> methods.

It doesn't exist yet.  I'm slowly working on one for Germanic, and there's
another effort which is working on one for all of Indo-European.
This latter effort is mostly being carried out by people in Holland.  I
don't have their URL handy, but a web search should turn it up.

I think that the general idea of the Holland group is excellent, but I
think the way they're carrying it out is unfortunate (and I've discussed
this with them): they're assigning each branch to a specialist in that
branch, and they're not planning to release anything to the public until
they have a product in a fairly advanced stage of completion.  I think
they could take a real lesson from free software efforts such as Linux,
where the slogan is "release early, release often".  They could get a lot
of enthusiastic volunteer help and a lot of useful feedback and
contribution if they'd open the project up to the public.

For example, when I recently asked for volunteers to each type a few pages
of an glossary of Old English whose copyright has expired, so that there
could be a free online glossary of OE, I got a strong response.  The whole
glossary was covered in a little over a month.  Many of the responses were
from enthusiatic non-specialists (altho some specialists took some pages
too).  Sharing the work this way gets things done quickly.

I'm poking at making a free online etymological database of the older
Germanic languages; this database will be totally free, so that others can
freely create useful derived works from it.  As a preliminary, I'm taking
old glossaries whose copyright has expired and putting them online.  I've
done this with a Gothic glossary and am working on the one for Old
English; I'll probably do Old High German next and then work on actually
creating the database.

The next step toward that end will be to mark these glossaries up,
probably with SGML tags of some sort, so that all the information can
automatically be folded into one big database.  This markup can probably
be largely done by program, followed by hand correction.

My long-term dream is to have all the references in this database which
anyone could need: concordance-style references to instances of a word in
the text, full conjugations and declensions, alternate spellings, pointers
between cognates in related languages, pointers between compounds and
their constituent elements, etc.  Once there is such a database, one of
the projects I'd like to tinker with is automated language reconstruction;
I've got some ideas about how one might write a program to do this.  The
database has to come first, tho.

What bits I've got online are at
http://www.ling.upenn.edu/~kurisuto/language_resources.html

I'd be quite open to working with others on developing these resources.
One of my main priorities is to produce resources which are free, both in
terms of cost and in terms of freedom from any intellectual property
encumbrances.

  \/ __ __    _\_     --Sean Crist  (kurisuto at unagi.cis.upenn.edu)
 ---  |  |    \ /     http://www.ling.upenn.edu/~kurisuto/
  _| ,| ,|   -----
  _| ,| ,|    [_]
   |  |  |    [_]