[Corpora-List] license question
John F. Sowa
sowa at bestweb.net
Fri Aug 18 18:13:28 UTC 2006
There is a serious problem with that approach:
SS> This is why I advocate the procedure of distributing an
> Internet-derived corpus as a list of URLs.
Unfortunately, URLs are subject to two limitations:
1. They become "broken" whenever the web site or the
directory structure is changed.
2. Even when the URL is live, the content can be updated
and changed at any time.
These two points make a collection of URLs a highly unstable
way to assemble or distribute a corpus. They make it impossible
for any analysis performed at one instant of time to be compared
with any analysis performed at another time.
John Sowa
More information about the Corpora
mailing list