[Corpora-List] Seeking publically available corpora representing multiple document formats and languages
Allison, Timothy B.
tallison at mitre.org
Wed Jul 23 18:57:35 UTC 2014
All,
This may be slightly off topic, but I'm writing on behalf of the Apache Software Foundation's Tika project. As part of https://issues.apache.org/jira/browse/TIKA-1302, we're in search of publically available and hostable corpora that contain numerous and various document formats.
Govdocs1 (http://digitalcorpora.org/corpora/files) is a fantastic resource, and we're looking into getting a slice of CommonCrawl, but is anyone aware of other corpora that we should consider? The goal is to have a large set (to be defined) of docs to use in regression testing as part of continuous integration.
Thank you.
Best,
Tim
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list