[Corpora-List] SourceForge as a corpus

Chris Jordan chris.jordan at acm.org
Thu Jan 24 16:15:19 UTC 2008


I think that in terms of academic utility, papers which present work  
that uses a given corpus should also be made available when possible.  
The combination of accessibility to both data and academic literature  
will reduce the effort required in both accessing a body of work and  
reproducing experimental results. Facilitating the reproducibility of  
others work should be one of the main goals in such a public  
repository as they will form the baselines for future research.

-- 
Chris Jordan


On 24-Jan-08, at 11:56 AM, radev at umich.edu wrote:

> We could start by creating a page on the ACL wiki:
>
> http://aclweb.org/aclwiki/
>
> with a list of candidate corpora and contact people for each of
> them. Here are some examples: Google n-grams, Enron email, GENIA, etc.
>
> Drago
>
>>
>> On Jan 24, 2008 8:45 AM,  <radev at umich.edu> wrote:
>>> We need a public corpus repository.  Perhaps something worth  
>>> starting
>>> a discussion about.
>>
>> I agree that a discussion is a good idea.  To kick it off: one of the
>> things that I know about any corpus on my SourceForge site is that  
>> all
>> copyright issues are in order.  One of the things that you know when
>> LDC hosts your corpus for you is that they will make sure that all
>> copyright issues are in order.  What would be a mechanism for  
>> ensuring
>> this in a public corpus repository?  One option would be to control
>> deposition of data in the same way that any SourceForge project vetts
>> its participants; the people with the responsibility for doing this
>> would then be tasked with exercising due diligence in verifying that
>> the corpus builders themselves had cleared all copyright issues.  On
>> this model, responsibility for dealing with copyright issues stays
>> with the corpus builders, not the SourceForge project coordinators.
>> However, that doesn't make the project coordinators' work be zero,  
>> and
>> it's not clear how that work could be funded in the long term.
>> Thoughts?
>>
>> Kev
>>
>> -- 
>> K. B. Cohen
>> Biomedical Text Mining Group Lead
>> Center for Computational Pharmacology
>> 303-916-2417 (cell) 303-377-9194 (home)
>> http://compbio.uchsc.edu/Hunter_lab/Cohen
>>
>>
>
>
> -- 
> Dragomir R. Radev                    Associate Professor
> SI, CSE, Ling                     U. Michigan, Ann Arbor
> http://www.eecs.umich.edu/~radev         radev at umich.edu
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list