[Corpora-List] Compiling an engineering (paper mill) corpus

Tue Apr 4 07:30:24 UTC 2006

We'd like to compile a medium-sized corpus of texts related to engineering 
and paper mills: their design, maintenance, installation, etc., and would 
like to do so (mostly) automatically, by collecting relevant documents 
from the Web.

The corpus will then be used for the purpose of automatic mining of 
terminology.

We don't care in which format the documents are; html, pdf, doc, all 
should be ok, as long as text can be extracted from them.

Are there established methods for gathering such collections of documents 
reasonably quickly and with not too much manual effort?

Thanks,

Jaakko

--

Jaakko Nyrölä
Student at the Helsinki University of Technology
jnyrola at cc.hut.fi