[Corpora-List] Criteria for an ESP Vocabulary List

Tue Apr 29 09:06:22 UTC 2008

Michael,

You ask
> Can the WebBootCaT tool you mention be used independently of
SketchEngine ...
No, but the price is affordable.  BootCaT is available for free, so may well
suit people with the skills to run perl scripts.   WebBootCaT handles the
processes of cleaning up the data, removing duplicates, POS-tagging and
lemmatising (for quite a few lgs) and loading into the corpus tool, and
hosts the data, which, even for some people with perl skills, will be worth
a couple of cups of coffee a month.

Filtering texts where there is evidence that they are not written in good
English is current research.  I'm not sure if that fits what you mean by
unauthoritative sources.  There is usually a tradeoff between "getting
exactly what you want" and taking too narrow a view of the language type you
are seeking.

The other trouble with 'authoritative sources' is it implies checking them
one-by-one, with corpora correspondingly much smaller and slower to
produce.  So people are often stuck with a choice: get a corpus that is
large, quick, and on target but without knowing exactly what is in it OR
make do with one that is much smaller and/or doesn't really fit your
research agenda or teaching plan.

adam

2008/4/28 <M.I.Friedbichler at uibk.ac.at>:

>  Michael Friedbichler wrote on Sat, 26 Apr 2008 11:21:27 +0200:
>  *> > You should be aware, though, that this is not a project you can *
> *> > complete within a few weeks.*
>
> Adam Kilgarriff wrote on Mon, 28 Apr 2008 07:58:07 +0100:
>  *> This kind of corpus-building can be done very quickly using*
> *> BootCaT and related tools, eg WebBootCaT (available at*
> *> http://www.sketchengine.co.uk ).*
> *> The basic process takes a few minutes, and a series of*
> *> iterations, to refine and improve the corpus, may be a day or two's
> work. We also*
> *> build in lemmatising, POS-tagging and loading into a corpus query tool.
> *
>
> Adam, dear corpora list members:
>
> If one doesn't mind the noise in corpora derived from the web, this is
> indeed an elegant solution. Getting rid of all the unauthoritative
> sources, however, might be a time-consuming task lurking behind the
> seemingly instant harvest from the web.
>
> Whether WaC-tools (Web as Corpus) like WebBootCaT -- which represent a
> great step forward in compiling DIY corpora for computer-assisted
> translation (isn't this where BootCaT got its name?) -- are also ideal for
> the purpose at hand, is open to question. For teaching purposes, esp. in
> ESP, I think I'd rather have authoritative sources. After all,
> distinguishing between professional language use and unreliable, poorly
> edited sources is evidently not a task for language learners. You're not
> going to get clear water from a mudpot!
>
> Another point of interest in this context: Can the WebBootCaT tool you
> mention be used independently of SketchEngine or is it accessible only for
> those who have purchased the corpus query tool?
>
> Best,
> Michael Friedbichler
> Innsbruck Medical University
>
>
>

-- 
================================================
Adam Kilgarriff http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080429/052ab55e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora