[Corpora-List] Compiling an engineering (paper mill) corpus
Jaakko Nyrölä
jnyrola at cc.hut.fi
Tue Apr 4 07:30:24 UTC 2006
We'd like to compile a medium-sized corpus of texts related to engineering
and paper mills: their design, maintenance, installation, etc., and would
like to do so (mostly) automatically, by collecting relevant documents
from the Web.
The corpus will then be used for the purpose of automatic mining of
terminology.
We don't care in which format the documents are; html, pdf, doc, all
should be ok, as long as text can be extracted from them.
Are there established methods for gathering such collections of documents
reasonably quickly and with not too much manual effort?
Thanks,
Jaakko
--
Jaakko Nyrölä
Student at the Helsinki University of Technology
jnyrola at cc.hut.fi
More information about the Corpora
mailing list