[Corpora-List] license question

Serge Sharoff S.Sharoff at leeds.ac.uk
Mon Aug 21 09:30:09 UTC 2006


> > Unfortunately, URLs are subject to two limitations:
> >
> >   1. They become "broken" whenever the web site or the
> >      directory structure is changed.
> >
> >   2. Even when the URL is live, the content can be updated
> >      and changed at any time.
> >
> > These two points make a collection of URLs a highly unstable
> > way to assemble or distribute a corpus.  They make it impossible
> > for any analysis performed at one instant of time to be compared
> > with any analysis performed at another time.
I mentioned these two problems in my original message, but I also mentioned the need to measure the rate of change.  Your reference to "highly unstable" means that you know how unstable webpages are.  They are indeed not very stable. Out of the original set of 1000 English URLs obtained from Google in September, 2005, 972 were available online at that time.  Since then about 6-8 URLs from the original list disappear each month, so that this August we have 868 pages available from the original list.  The dropout rate is more or less the same for Chinese, German and RUssian URLs (the language here refers to the main language used on a page, not to the location of its server).  Another experiment we haven't done yet is to measure statistical differences in the frequency of words and n-grams on pages retrieved now and the year before that.

 
> One potential solution to these problems is to distribute URLs on the
> Internet Archive's "wayback machine" (www.archive.org).  If the URLs
> of interest are for pages that are present in the archive, locating a
> snapshot and confirming that the content is the same as your stored
> page should be relatively straightforward.   The Internet Archive is
> not always the most reliable option, since pages are sometimes
> unavailable or may not have been included in their snapshots in the
> first place, but in my experience it's not too terrible.
I agree.  Unfortunately, the Wayback machine is down at the moment.  I'll try to run a comparison with its archive once it's back.

Serge



More information about the Corpora mailing list