[Corpora-List] Compiling an engineering (paper mill) corpus

Eric Atwell eric at comp.leeds.ac.uk
Tue Apr 4 09:27:38 UTC 2006


Jaako,

You could use Google to track down websites of Web-as-Corpus practioners 
eg Marco Baroni, Delphine Bernhard, Adam Kilgarriff, Jan Pomikalek, 
Antionette Renouf, Serge Sharoff etc ...

... and then use the methods and tools they advocate for this task.
BootCaT etc can cope with PDF, doc etc as it uses Google (or Yahoo) to trawl
the web, and these convert to text automatically.

I have just set a student courswork exercise to collect a web-corpus on
a specific domain, using their tools; so you could just use the 
Coursework Instructions as a "crib-sheet" telling you what to do:

http://www.comp.leeds.ac.uk/eric/db32cw.doc

The big challenge is to identify the websites which represent your
domain.  You could "manually" (eg using Google) identify some likely
websites whcih you think realte to paper mills engineering, and then
"mine" these.  Or you could try to identify some key terminology 
specific to paper mills engineering, and then use BootCat or similar 
to find other websites with these terms.


Have fun! (my students did!)

Eric Atwell, Leeds University




On Tue, 4 Apr 2006, Jaakko Nyrölä wrote:

> We'd like to compile a medium-sized corpus of texts related to engineering 
> and paper mills: their design, maintenance, installation, etc., and would 
> like to do so (mostly) automatically, by collecting relevant documents from 
> the Web.
>
> The corpus will then be used for the purpose of automatic mining of 
> terminology.
>
> We don't care in which format the documents are; html, pdf, doc, all should 
> be ok, as long as text can be extracted from them.
>
> Are there established methods for gathering such collections of documents 
> reasonably quickly and with not too much manual effort?
>
> Thanks,
>
> Jaakko
>
> --
>
> Jaakko Nyrölä
> Student at the Helsinki University of Technology
> jnyrola at cc.hut.fi
>
>
>
>

-- 
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-3435430  FAX: +44-113-3435468  http://www.comp.leeds.ac.uk/eric


More information about the Corpora mailing list