[Corpora-List] license question

P Resnik psresnik at gmail.com
Fri Aug 18 18:59:37 UTC 2006


> Unfortunately, URLs are subject to two limitations:
>
>   1. They become "broken" whenever the web site or the
>      directory structure is changed.
>
>   2. Even when the URL is live, the content can be updated
>      and changed at any time.
>
> These two points make a collection of URLs a highly unstable
> way to assemble or distribute a corpus.  They make it impossible
> for any analysis performed at one instant of time to be compared
> with any analysis performed at another time.

One potential solution to these problems is to distribute URLs on the
Internet Archive's "wayback machine" (www.archive.org).  If the URLs
of interest are for pages that are present in the archive, locating a
snapshot and confirming that the content is the same as your stored
page should be relatively straightforward.   The Internet Archive is
not always the most reliable option, since pages are sometimes
unavailable or may not have been included in their snapshots in the
first place, but in my experience it's not too terrible.

I adopted this solution because it's a lot safer than just ignoring
copyright issues and distributing the pages, a lot easier than hunting
down copyright permissions for a zillion Web pages, and generally
better than using original URLs for the reasons noted above. For an
example, take a look at the July 2003 Chinese-English corpus I made
available in this way, at http://umiacs.umd.edu/~resnik/strand/.

  Philip



More information about the Corpora mailing list