Corpora: New Bookmark site for Corpus-based Linguists

David Lee david_lee00 at
Sun Dec 2 23:43:30 UTC 2001

Dear All,

I would like to announce to the list that I've just created a web site with
links for corpus-based linguistics. The web pages started as a resource for
the MA students here at Lancaster doing the corpus-based linguistics course,
but perhaps it may have a wider appeal. Before I go into any more detail,
however, I wish to thank the many corpus linguists who 'test-drove' the site
and took the time to give me valuable feedback. They've helped make the site
more accessible to all and more complete.

There are already a number of sites out there with similar content, but here
are a few key things about my site:

(1) it's up-to-date (I've checked practically all the links to make sure
there are no dead ones, except those which are permanently dead (i.e. no
known new URL!);  (2) it focuses on links for linguists and lg teachers (not
NLP/lg engineering); (3) My listings are mostly annotated (i.e. have
descriptions of the links, so you don't always have to click a link to find
out what it's about); (4) I hope I've brought together in *one* place all
the information on corpora, software tools, bibliographies, references,
electronic papers, mailing lists, on-line courses, conferences, etc. that
people doing corpus work will possibly need.

The web site is, I believe, fairly exhaustive (for English corpora, tools,
and references, at least), but I would make the usual plea for people to
contact me with more links and resources that I've missed, if they spot any
mistakes (dead links, non-existent sites, wrong information, etc.), and
especially if they have written papers/notes/squibs which are available
*on-line* for downloading.

Please take the trouble to let me know about anything you'd like to share
with the rest of the research community (links, papers and resources... e.g.
if you've collected a (small) corpus or collection of materials which is
available on-line or could be made available). The usefulness of the site
will be increased if people actively participate to keep it current and

The URL alias for my site is:

which I think is rather nice ;-). Please bookmark this web address rather
than anything else, as this is permanent, whereas other page/frame addresses
may well change their names without warning. The downside of using this
mnemonic alias is that a little advert window pops up... just ignore it and
close it down immediately.

As I've said, I would appreciate it if people could have a look and give me
feedback (e.g. "I think it's great!" or if you have any suggestions on how
to improve the structure/organisation of information, or if there are any
glaring omissions), bearing in mind that this site is meant primarily for
*linguists* or lg teachers (and, secondarily, for humanities scholars) who
happen to work with corpora, not speech technologists or NLP people
(although I've also provided the most important 'technical' links, so that
people who wish to get more info in that direction may do so).

I think that this site is more organised and more complete than most of the
other sites that I've seen, which are geared more towards *NLP* /lg
technology, and also tend to lump everything on one page (so that you have
to wade through lots of undifferentiated stuff to find what you want).
I've tried not to replicate other bookmark sites (e.g. Mike Barlow's and
Manuel Barbera's (whose links for *non-English* corpora I have not even
tried to duplicate) but at the same time I've deliberately repeated some of
the main links for the sake of convenience (there is no point in continually
sending people to other people's web sites!), so that at much info as
possible is provided on-site.

Hope this will be of use to some.


David Lee

P.S. Some people may be interested to know that I've now produced a new
version of my BNC Index: for the  BNC World Edition (BNC version 2). You
might want to download this new version if the World Edition is the corpus
you're now using (which you should do... the World Edition has fewer tagging
and text classification errors than BNC 1...). My new web address for this
is:   (the old address is now permanently gone and
no longer valid).

More information about the Corpora mailing list