[Corpora-List] Legal aspects of compiling corpora

Jason Eisner jason at cs.jhu.edu
Sat Jun 14 19:03:36 UTC 2003


Larry Spitz writes:

> Aside from the legal aspect of collecting text are the legal aspects of
> collecting scanned images of documents. For those of us who are interested
> in the analysis of document images obtaining databases of images is quite
> difficult, particularly generally available databases where the results of
> individual research can be compared.
>
> Since the University of Washington and the University of Nevada, Las Vegas
> have stopped publishing such databases, I do not know of anyone who is in
> the process of doing so.

Larry,

The ACL Anthology at http://www.aclweb.org/anthology is such a
database, containing about 44,000 pages so far.  It is a fairly
comprehensive archive of articles from the major computational
linguistics conferences, journals, and workshops since they began in
1979.  Choose the US mirror to get the most up-to-date version.

The anthology's editors may wish to jump in and correct me here, but I
believe that all of the 20th-century papers were scanned in
physically, as no electronic proceedings were available.  The scans
were done recently and are of high quality.  The documents are
provided as PDF image files that also seem to contain an OCR'd copy of
the text, allowing the text to be highlighted and searched.  The OCR
has occasional mistakes, particularly on formulas, but generally seems
excellent

> one of the real problems is getting copyright permission on document images.

The notice on the anthology says:

  COPYRIGHT: These materials are Copyright (C) 1979-2003
  ACL. Permission is granted to make copies for the purposes of
  teaching and research.

Also note:

  The ACL requests your help to support this effort financially. The
  total cost of digitizing past publications will be approximately
  $50,000. All other activities associated with the project are being
  done with free labor.  All the resulting materials will be available
  for free on the web.

Cheers,
Jason Eisner
Johns Hopkins University



More information about the Corpora mailing list