[Corpora-List] Copyright question again

Mark Davies Mark_Davies at byu.edu
Tue Jan 6 13:36:03 UTC 2015


Marc Brysbaert wrote:


>> For what it is worth, in my experience word frequency lists and N-gram lists are not a problem.

I agree. I've distributed COCA/COHA word frequency (http://www.wordfrequency.info) and n-grams (http://www.ngrams.info) data for several years now, and I've never had any issues.


>> The big problem we are encountering is that currently there is no guidance about whether corpora can be shared. As a result, nearly all corpora assembled remain next to inaccessible, meaning that everyone has to collect their own corpus. This is a lot of needless work and also means that little cumulative work can be done.


I've also been distributing "full-text" data from 450 million word COCA and the 1.9 billion word GloWbE (http://corpus.byu.edu/glowbe) for a while now, and again no problems to this point. There is a "twist", though, in terms of how the full-text data has been slightly altered to avoid copyright problems:


http://corpus.byu.edu/full-text/limitations.asp


?Best,


Mark D.


============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20150106/65e4cae1/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list