[Corpora-List] license question

Fri Aug 18 21:10:00 UTC 2006

As you say, these are workarounds, and I don't think they answer the  
substance of John's objections. The Internet Archive might not have  
the data, and Google's cache doesn't make long term commitments to  
keep dated corpora for public use. (They might do so one day - their  
publication of the 5-gram corpus demonstrates a big step in the right  
direction, I hope.) Once you've copied the material to your own  
website, you are effectively back with the technology of copying the  
whole corpus rather than using references - and then, there are no  
guarantees that your copy will stay synchronized with the original.

At the risk of beating a drum, I believe that we have prototyped the  
long-term solution to these problems at MAYA Design with an  
extensible peer-to-peer database. The idea of using this technology  
for language corpora is described in out LREC paper at
http://www.maya.com/local/widdows/papers/lrec-distributed-corpora.pdf

Represent individual texts as objects in a peer-to-peer network, and  
larger corpora using collections of universal references to these  
texts. But don't use location dependent URLs, because they're  
brittle, and they place the hosting costs on the worthy individuals  
who put the effort into gathering the corpus in the first place.  
Instead, use location independent universal identifiers (as the Free  
Software Foundation has done for years, and is now part of the  
official URN namespace), and encourage replication at the point of  
use. Use digital signatures to make sure that the data hasn't  
changed. If the publishing organization wants to go the whole way and  
make sure that the contents can never change, incorporate part of the  
digital signature into the identifier of each object. You can also  
use this as the core data for sharing standoff annotation,  
collaborative filtering, etc.

And then you're more or less done - provided you can solve the peer- 
to-peer routing problem, and make sure that the economics of the  
system works well enough to encourage individuals and organizations  
to take part. These aren't trivial problems, of course - but the  
reliability will surely be better than URLs in the long run, and the  
economics will surely be more encouraging than "make the provider pay  
the bandwidth cost."

Best wishes,
Dominic

On Aug 18, 2006, at 4:29 PM, Steven Bird wrote:

> There's a couple of workarounds:
>
> Use an archive:
> a) try to find all the URLs in the Internet Archive or Google's cache
> b) submit missing URLs to such repositories (I think this can even be
>     done for Google's cache, by setting a very large expiry time.)
>
> Create an archive:
> a) "mirror" a superset of the material on your own public website
> b) publish URLs local to this site
>
> On 8/19/06, John F. Sowa <sowa at bestweb.net> wrote:
>
>> There is a serious problem with that approach:
>>
>> SS> This is why I advocate the procedure of distributing an
>>  > Internet-derived corpus as a list of URLs.
>>
>> Unfortunately, URLs are subject to two limitations:
>>
>>   1. They become "broken" whenever the web site or the
>>      directory structure is changed.
>>
>>   2. Even when the URL is live, the content can be updated
>>      and changed at any time.
>>
>> These two points make a collection of URLs a highly unstable
>> way to assemble or distribute a corpus.  They make it impossible
>> for any analysis performed at one instant of time to be compared
>> with any analysis performed at another time.
>>
>
>
>