[Corpora-List] license question

John F. Sowa sowa at bestweb.net
Fri Aug 18 18:13:28 UTC 2006


There is a serious problem with that approach:

SS> This is why I advocate the procedure of distributing an
 > Internet-derived corpus as a list of URLs.

Unfortunately, URLs are subject to two limitations:

  1. They become "broken" whenever the web site or the
     directory structure is changed.

  2. Even when the URL is live, the content can be updated
     and changed at any time.

These two points make a collection of URLs a highly unstable
way to assemble or distribute a corpus.  They make it impossible
for any analysis performed at one instant of time to be compared
with any analysis performed at another time.

John Sowa



More information about the Corpora mailing list