[Corpora-List] license question
Dominic Widdows
widdows at maya.com
Fri Aug 18 21:10:00 UTC 2006
As you say, these are workarounds, and I don't think they answer the
substance of John's objections. The Internet Archive might not have
the data, and Google's cache doesn't make long term commitments to
keep dated corpora for public use. (They might do so one day - their
publication of the 5-gram corpus demonstrates a big step in the right
direction, I hope.) Once you've copied the material to your own
website, you are effectively back with the technology of copying the
whole corpus rather than using references - and then, there are no
guarantees that your copy will stay synchronized with the original.
At the risk of beating a drum, I believe that we have prototyped the
long-term solution to these problems at MAYA Design with an
extensible peer-to-peer database. The idea of using this technology
for language corpora is described in out LREC paper at
http://www.maya.com/local/widdows/papers/lrec-distributed-corpora.pdf
Represent individual texts as objects in a peer-to-peer network, and
larger corpora using collections of universal references to these
texts. But don't use location dependent URLs, because they're
brittle, and they place the hosting costs on the worthy individuals
who put the effort into gathering the corpus in the first place.
Instead, use location independent universal identifiers (as the Free
Software Foundation has done for years, and is now part of the
official URN namespace), and encourage replication at the point of
use. Use digital signatures to make sure that the data hasn't
changed. If the publishing organization wants to go the whole way and
make sure that the contents can never change, incorporate part of the
digital signature into the identifier of each object. You can also
use this as the core data for sharing standoff annotation,
collaborative filtering, etc.
And then you're more or less done - provided you can solve the peer-
to-peer routing problem, and make sure that the economics of the
system works well enough to encourage individuals and organizations
to take part. These aren't trivial problems, of course - but the
reliability will surely be better than URLs in the long run, and the
economics will surely be more encouraging than "make the provider pay
the bandwidth cost."
Best wishes,
Dominic
On Aug 18, 2006, at 4:29 PM, Steven Bird wrote:
> There's a couple of workarounds:
>
> Use an archive:
> a) try to find all the URLs in the Internet Archive or Google's cache
> b) submit missing URLs to such repositories (I think this can even be
> done for Google's cache, by setting a very large expiry time.)
>
> Create an archive:
> a) "mirror" a superset of the material on your own public website
> b) publish URLs local to this site
>
> On 8/19/06, John F. Sowa <sowa at bestweb.net> wrote:
>
>> There is a serious problem with that approach:
>>
>> SS> This is why I advocate the procedure of distributing an
>> > Internet-derived corpus as a list of URLs.
>>
>> Unfortunately, URLs are subject to two limitations:
>>
>> 1. They become "broken" whenever the web site or the
>> directory structure is changed.
>>
>> 2. Even when the URL is live, the content can be updated
>> and changed at any time.
>>
>> These two points make a collection of URLs a highly unstable
>> way to assemble or distribute a corpus. They make it impossible
>> for any analysis performed at one instant of time to be compared
>> with any analysis performed at another time.
>>
>
>
>
More information about the Corpora
mailing list