[Corpora-List] license question

Steven Bird sb at csse.unimelb.edu.au
Fri Aug 18 20:29:11 UTC 2006


There's a couple of workarounds:

Use an archive:
a) try to find all the URLs in the Internet Archive or Google's cache
b) submit missing URLs to such repositories (I think this can even be
     done for Google's cache, by setting a very large expiry time.)

Create an archive:
a) "mirror" a superset of the material on your own public website
b) publish URLs local to this site

On 8/19/06, John F. Sowa <sowa at bestweb.net> wrote:
> There is a serious problem with that approach:
>
> SS> This is why I advocate the procedure of distributing an
>  > Internet-derived corpus as a list of URLs.
>
> Unfortunately, URLs are subject to two limitations:
>
>   1. They become "broken" whenever the web site or the
>      directory structure is changed.
>
>   2. Even when the URL is live, the content can be updated
>      and changed at any time.
>
> These two points make a collection of URLs a highly unstable
> way to assemble or distribute a corpus.  They make it impossible
> for any analysis performed at one instant of time to be compared
> with any analysis performed at another time.



More information about the Corpora mailing list