[Corpora-List] SourceForge as a corpus

Thu Jan 24 22:43:33 UTC 2008

While I do appreciate this suggestion, there is more to a corpus 
repository than storage space. Granted, LinguistList has wonderful 
programmers who keep everything running and the software current. I like 
the SourceForge idea. There is currently linguistic software being 
developed there, and the version control system in their repository 
allows for everyone to quickly note the changes. They have mirror sites 
around the world, to ensure that every release is redundantly available 
with fast download times, and keep their hardware up to date. (This 
costs considerably more than merely a few hard drives.)

Now, I would see the need for an independent and open corpus repository. 
On one hand, that's what the Oxford Text Archive was envisioned to be. 
But it is not designed for open corpora, where people can contribute, 
etc. Perhaps we can found a separate organization that would solely 
provide a repository for data. And would support, much like the LDC, 
commercial corpora, too. I would envision a user system like SourceForge 
has, where users can subscribe to open corpora (be notified of changes) 
and also be given rights to download corpora that require licensing 
agreements. Once the license agreement is signed, which could be 
electronically (much like a EULA), provided there is no fee involved, 
the user would have open access to the corpus.

In order for this to work, I'd like to see a license agreement for open 
corpora. We could start with Creative Commons licensing, and move on to 
a license unique to open corpora. In addition, there could be several 
commercial corpus licenses. The days where written corpora were 
expensive to create are mostly over. We're seeing new corpora based on 
web data arising, including monitor corpora such as accompanies the ANC. 
If the GPL works for FOSS, we can also work something out that works for 
corpora. By all pulling together and collaborating to create useful 
corpora, we can create a new frontier in linguistics.

Best,

Klaus Guenther
Universität Bamberg

james L. fidelholtz schrieb:
> Hi, all,
>
> The list owners for the Linguist Network might not appreciate this
> suggestion (although they might), but the LinguistList seems like a
> prime candidate for corpus storage (assuming they have the hardware
> capability). They already host just about every list having anything
> to do with linguistics or language (often, as with this list, as a
> mirror site), so we know they have decent storage capacity. We could
> even all contribute to their annual fund drive, so they could buy a
> few terabyte storage drives.
>
> Jim
>
> On 1/24/08, Chris Jordan <chris.jordan at acm.org> wrote:
>   
>> I think that in terms of academic utility, papers which present work
>> that uses a given corpus should also be made available when possible.
>> The combination of accessibility to both data and academic literature
>> will reduce the effort required in both accessing a body of work and
>> reproducing experimental results. Facilitating the reproducibility of
>> others work should be one of the main goals in such a public
>> repository as they will form the baselines for future research.
>>
>> --
>> Chris Jordan
>>
>>
>> On 24-Jan-08, at 11:56 AM, radev at umich.edu wrote:
>>
>>     
>>> We could start by creating a page on the ACL wiki:
>>>
>>> http://aclweb.org/aclwiki/
>>>
>>> with a list of candidate corpora and contact people for each of
>>> them. Here are some examples: Google n-grams, Enron email, GENIA, etc.
>>>
>>> Drago
>>>
>>>       
>>>> On Jan 24, 2008 8:45 AM,  <radev at umich.edu> wrote:
>>>>         
>>>>> We need a public corpus repository.  Perhaps something worth
>>>>> starting
>>>>> a discussion about.
>>>>>           
>>>> I agree that a discussion is a good idea.  To kick it off: one of the
>>>> things that I know about any corpus on my SourceForge site is that
>>>> all
>>>> copyright issues are in order.  One of the things that you know when
>>>> LDC hosts your corpus for you is that they will make sure that all
>>>> copyright issues are in order.  What would be a mechanism for
>>>> ensuring
>>>> this in a public corpus repository?  One option would be to control
>>>> deposition of data in the same way that any SourceForge project vetts
>>>> its participants; the people with the responsibility for doing this
>>>> would then be tasked with exercising due diligence in verifying that
>>>> the corpus builders themselves had cleared all copyright issues.  On
>>>> this model, responsibility for dealing with copyright issues stays
>>>> with the corpus builders, not the SourceForge project coordinators.
>>>> However, that doesn't make the project coordinators' work be zero,
>>>> and
>>>> it's not clear how that work could be funded in the long term.
>>>> Thoughts?
>>>>
>>>> Kev
>>>>
>>>> --
>>>> K. B. Cohen
>>>> Biomedical Text Mining Group Lead
>>>> Center for Computational Pharmacology
>>>> 303-916-2417 (cell) 303-377-9194 (home)
>>>> http://compbio.uchsc.edu/Hunter_lab/Cohen
>>>>
>>>>
>>>>         
>>> --
>>> Dragomir R. Radev                    Associate Professor
>>> SI, CSE, Ling                     U. Michigan, Ann Arbor
>>> http://www.eecs.umich.edu/~radev         radev at umich.edu
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>       
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>     
>
>
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora