[Corpora-List] Compiling an engineering (paper mill) corpus
Eric Atwell
eric at comp.leeds.ac.uk
Tue Apr 4 09:27:38 UTC 2006
Jaako,
You could use Google to track down websites of Web-as-Corpus practioners
eg Marco Baroni, Delphine Bernhard, Adam Kilgarriff, Jan Pomikalek,
Antionette Renouf, Serge Sharoff etc ...
... and then use the methods and tools they advocate for this task.
BootCaT etc can cope with PDF, doc etc as it uses Google (or Yahoo) to trawl
the web, and these convert to text automatically.
I have just set a student courswork exercise to collect a web-corpus on
a specific domain, using their tools; so you could just use the
Coursework Instructions as a "crib-sheet" telling you what to do:
http://www.comp.leeds.ac.uk/eric/db32cw.doc
The big challenge is to identify the websites which represent your
domain. You could "manually" (eg using Google) identify some likely
websites whcih you think realte to paper mills engineering, and then
"mine" these. Or you could try to identify some key terminology
specific to paper mills engineering, and then use BootCat or similar
to find other websites with these terms.
Have fun! (my students did!)
Eric Atwell, Leeds University
On Tue, 4 Apr 2006, Jaakko Nyrölä wrote:
> We'd like to compile a medium-sized corpus of texts related to engineering
> and paper mills: their design, maintenance, installation, etc., and would
> like to do so (mostly) automatically, by collecting relevant documents from
> the Web.
>
> The corpus will then be used for the purpose of automatic mining of
> terminology.
>
> We don't care in which format the documents are; html, pdf, doc, all should
> be ok, as long as text can be extracted from them.
>
> Are there established methods for gathering such collections of documents
> reasonably quickly and with not too much manual effort?
>
> Thanks,
>
> Jaakko
>
> --
>
> Jaakko Nyrölä
> Student at the Helsinki University of Technology
> jnyrola at cc.hut.fi
>
>
>
>
--
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-3435430 FAX: +44-113-3435468 http://www.comp.leeds.ac.uk/eric
More information about the Corpora
mailing list