[Corpora-List] SourceForge as a corpus
Chris Jordan
chris.jordan at acm.org
Thu Jan 24 16:15:19 UTC 2008
I think that in terms of academic utility, papers which present work
that uses a given corpus should also be made available when possible.
The combination of accessibility to both data and academic literature
will reduce the effort required in both accessing a body of work and
reproducing experimental results. Facilitating the reproducibility of
others work should be one of the main goals in such a public
repository as they will form the baselines for future research.
--
Chris Jordan
On 24-Jan-08, at 11:56 AM, radev at umich.edu wrote:
> We could start by creating a page on the ACL wiki:
>
> http://aclweb.org/aclwiki/
>
> with a list of candidate corpora and contact people for each of
> them. Here are some examples: Google n-grams, Enron email, GENIA, etc.
>
> Drago
>
>>
>> On Jan 24, 2008 8:45 AM, <radev at umich.edu> wrote:
>>> We need a public corpus repository. Perhaps something worth
>>> starting
>>> a discussion about.
>>
>> I agree that a discussion is a good idea. To kick it off: one of the
>> things that I know about any corpus on my SourceForge site is that
>> all
>> copyright issues are in order. One of the things that you know when
>> LDC hosts your corpus for you is that they will make sure that all
>> copyright issues are in order. What would be a mechanism for
>> ensuring
>> this in a public corpus repository? One option would be to control
>> deposition of data in the same way that any SourceForge project vetts
>> its participants; the people with the responsibility for doing this
>> would then be tasked with exercising due diligence in verifying that
>> the corpus builders themselves had cleared all copyright issues. On
>> this model, responsibility for dealing with copyright issues stays
>> with the corpus builders, not the SourceForge project coordinators.
>> However, that doesn't make the project coordinators' work be zero,
>> and
>> it's not clear how that work could be funded in the long term.
>> Thoughts?
>>
>> Kev
>>
>> --
>> K. B. Cohen
>> Biomedical Text Mining Group Lead
>> Center for Computational Pharmacology
>> 303-916-2417 (cell) 303-377-9194 (home)
>> http://compbio.uchsc.edu/Hunter_lab/Cohen
>>
>>
>
>
> --
> Dragomir R. Radev Associate Professor
> SI, CSE, Ling U. Michigan, Ann Arbor
> http://www.eecs.umich.edu/~radev radev at umich.edu
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list