<div>Michael,</div>
<div> </div>
<div>You ask</div>
<div><font size="5"><font size="2">> Can the WebBootCaT tool you mention be used independently of SketchEngine ...</font></font><font size="5"> </font><br></div>
<div>No, but the price is affordable. BootCaT is available for free, so may well suit people with the skills to run perl scripts. WebBootCaT handles the processes of cleaning up the data, removing duplicates, POS-tagging and lemmatising (for quite a few lgs) and loading into the corpus tool, and hosts the data, which, even for some people with perl skills, will be worth a couple of cups of coffee a month.</div>
<div> </div>
<div>Filtering texts where there is evidence that they are not written in good English is current research. I'm not sure if that fits what you mean by unauthoritative sources. There is usually a tradeoff between "getting exactly what you want" and taking too narrow a view of the language type you are seeking.</div>
<div> </div>
<div>The other trouble with 'authoritative sources' is it implies checking them one-by-one, with corpora correspondingly much smaller and slower to produce. So people are often stuck with a choice: get a corpus that is large, quick, and on target but without knowing exactly what is in it OR make do with one that is much smaller and/or doesn't really fit your research agenda or teaching plan.</div>
<div> </div>
<div>adam </div>
<div><br> </div>
<div class="gmail_quote">2008/4/28 <<a href="mailto:M.I.Friedbichler@uibk.ac.at" target="_blank">M.I.Friedbichler@uibk.ac.at</a>>:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Michael Friedbichler </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">wrote on </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">Sat, 26 Apr 2008 11:21:27 +0200: </span></font></div>
<div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> > You should be aware, though, that this is not a project you can </i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> > complete within a few weeks.</i></span></font></div>
<div align="left"><br></div></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Adam Kilgarriff</span></font><font face="Arial"><span style="FONT-SIZE: 14pt"> wrote on Mon, 28 Apr 2008 07:58:07 +0100:</span></font></div>
<div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> This kind of corpus-building can be done very quickly using</i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> BootCaT and related tools, eg WebBootCaT (available at</i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> <a href="http://www.sketchengine.co.uk/" target="_blank">http://www.sketchengine.co.uk</a> ).</i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> The basic process takes a few minutes, and a series of</i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> iterations, to refine and improve the corpus, may be a day or two's work. We also</i></span></font></div>
<div align="left"><font face="Arial" color="#7f0000"><span style="FONT-SIZE: 14pt"><i>> build in lemmatising, POS-tagging and loading into a corpus query tool.</i></span></font></div>
<div align="left"><br></div></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Adam, dear corpora list members:</span></font></div>
<div align="left"><br></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">If one doesn't mind the noise in corpora derived from the web, this is </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">indeed</span></font><font face="Arial"><span style="FONT-SIZE: 14pt"> an elegant solution. Getting rid of all the unauthoritative sources, however, might be a time-consuming task lurking behind the seemingly instant harvest from the web. </span></font></div>
<div align="left"><br></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Whether WaC-tools (Web as Corpus) like WebBootCaT -- which represent a great step forward </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">in </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">compiling DIY corpora </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">for </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">computer-assisted translation (isn't this where BootCaT got its name?) -- are also ideal for the purpose at hand, is open to question. For teaching purposes, esp. in ESP, I think I'd rather have </span></font><font face="Arial"><span style="FONT-SIZE: 14pt">authoritative sources</span></font><font face="Arial"><span style="FONT-SIZE: 14pt">. After all, distinguishing between professional language use and unreliable, poorly edited sources is evidently not a task for language learners. You're not going to get clear water from a mudpot!</span></font></div>
<div align="left"><br></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Another point of interest in this context: Can the WebBootCaT tool you mention be used independently of SketchEngine or is it accessible only for those who have purchased the corpus query tool? </span></font></div>
<div align="left"><br></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Best,</span></font></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Michael Friedbichler</span></font></div>
<div align="left"><font face="Arial"><span style="FONT-SIZE: 14pt">Innsbruck Medical University</span></font></div>
<div align="left"><br></div>
<div align="left"><br></div>
<div align="left"></div></div></blockquote></div><br><br clear="all"><br>-- <br>================================================<br>Adam Kilgarriff <a href="http://www.kilgarriff.co.uk/" target="_blank">http://www.kilgarriff.co.uk</a> <br>
Lexical Computing Ltd <a href="http://www.sketchengine.co.uk/" target="_blank">http://www.sketchengine.co.uk</a><br>Lexicography MasterClass Ltd <a href="http://www.lexmasterclass.com/" target="_blank">http://www.lexmasterclass.com</a><br>
Universities of Leeds and Sussex <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a><br>================================================