[Corpora-List] SourceForge as a corpus

james L. fidelholtz fidelholtz at gmail.com
Thu Jan 24 20:47:48 UTC 2008


Hi, all,

The list owners for the Linguist Network might not appreciate this
suggestion (although they might), but the LinguistList seems like a
prime candidate for corpus storage (assuming they have the hardware
capability). They already host just about every list having anything
to do with linguistics or language (often, as with this list, as a
mirror site), so we know they have decent storage capacity. We could
even all contribute to their annual fund drive, so they could buy a
few terabyte storage drives.

Jim

On 1/24/08, Chris Jordan <chris.jordan at acm.org> wrote:
> I think that in terms of academic utility, papers which present work
> that uses a given corpus should also be made available when possible.
> The combination of accessibility to both data and academic literature
> will reduce the effort required in both accessing a body of work and
> reproducing experimental results. Facilitating the reproducibility of
> others work should be one of the main goals in such a public
> repository as they will form the baselines for future research.
>
> --
> Chris Jordan
>
>
> On 24-Jan-08, at 11:56 AM, radev at umich.edu wrote:
>
> > We could start by creating a page on the ACL wiki:
> >
> > http://aclweb.org/aclwiki/
> >
> > with a list of candidate corpora and contact people for each of
> > them. Here are some examples: Google n-grams, Enron email, GENIA, etc.
> >
> > Drago
> >
> >>
> >> On Jan 24, 2008 8:45 AM,  <radev at umich.edu> wrote:
> >>> We need a public corpus repository.  Perhaps something worth
> >>> starting
> >>> a discussion about.
> >>
> >> I agree that a discussion is a good idea.  To kick it off: one of the
> >> things that I know about any corpus on my SourceForge site is that
> >> all
> >> copyright issues are in order.  One of the things that you know when
> >> LDC hosts your corpus for you is that they will make sure that all
> >> copyright issues are in order.  What would be a mechanism for
> >> ensuring
> >> this in a public corpus repository?  One option would be to control
> >> deposition of data in the same way that any SourceForge project vetts
> >> its participants; the people with the responsibility for doing this
> >> would then be tasked with exercising due diligence in verifying that
> >> the corpus builders themselves had cleared all copyright issues.  On
> >> this model, responsibility for dealing with copyright issues stays
> >> with the corpus builders, not the SourceForge project coordinators.
> >> However, that doesn't make the project coordinators' work be zero,
> >> and
> >> it's not clear how that work could be funded in the long term.
> >> Thoughts?
> >>
> >> Kev
> >>
> >> --
> >> K. B. Cohen
> >> Biomedical Text Mining Group Lead
> >> Center for Computational Pharmacology
> >> 303-916-2417 (cell) 303-377-9194 (home)
> >> http://compbio.uchsc.edu/Hunter_lab/Cohen
> >>
> >>
> >
> >
> > --
> > Dragomir R. Radev                    Associate Professor
> > SI, CSE, Ling                     U. Michigan, Ann Arbor
> > http://www.eecs.umich.edu/~radev         radev at umich.edu
> >
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y
     Humanidades
Benemérita Universidad Autónoma de
     Puebla, MÉXICO

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list