[Corpora-List] Seeking publically available corpora representing multiple document formats and languages

Allison, Timothy B. tallison at mitre.org
Wed Jul 23 18:57:35 UTC 2014


All,

   This may be slightly off topic, but I'm writing on behalf of the Apache Software Foundation's Tika project.  As part of https://issues.apache.org/jira/browse/TIKA-1302, we're in search of publically available and hostable corpora that contain numerous and various document formats.  
  
  Govdocs1 (http://digitalcorpora.org/corpora/files) is a fantastic resource, and we're looking into getting a slice of CommonCrawl, but is anyone aware of other corpora that we should consider?  The goal is to have a large set (to be defined) of docs to use in regression testing as part of continuous integration. 

     Thank you.

              Best,

                               Tim
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list