<div>Thank you, guys. Your responses are very helpful!</div>
<div> </div>
<div>Best,</div>
<div> </div>
<div>Lushan<br><br></div>
<div class="gmail_quote">On Tue, Jul 17, 2012 at 1:42 AM, Adam Kilgarriff <span dir="ltr"><<a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a>></span> wrote:<br>
<blockquote style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" class="gmail_quote">UKWaC is bigger and available to download without bureaucracy - see <a href="http://wacky.sslmit.unibo.it/doku.php" target="_blank">http://wacky.sslmit.unibo.it/doku.php</a>
<div><br></div>
<div>Other are available with a bit more bureaucracy (to cover legal concerns), </div>
<div><br></div>
<div>> <span style="FONT-FAMILY:arial,sans-serif;COLOR:rgb(34,34,34);FONT-SIZE:13px"> a well-balanced corpus </span> </div>
<div><br></div>
<div>this is the million-dollar question. No-one really knows what it means. BNC and COCA are balanced according to their designers' opinion of what it means (mixed with the pragmatics of what was accessible). UKWaC and other web-crawled corpora are balanced according to the balance of the language as found on the web. Which is best, all depends on what you want. (Not that you are ever likely to find out which would have been better, since the science of the question is in its infancy)</div>
<div><br></div>
<div>BNC and COCA have the advantage that the different text types they contain are given to you in the metadata. For web crawls, working them out is a big and current research question (there's great work being done in Leeds by Serge Sharoff and Richard Sutcliffe , I've seen the talk but it's not published yet)</div>
<div><br></div>
<div>Adam<br><br>
<div class="gmail_quote">
<div>
<div class="h5">On 16 July 2012 23:42, Lushan Han <span dir="ltr"><<a href="mailto:lushan1@umbc.edu" target="_blank">lushan1@umbc.edu</a>></span> wrote:<br></div></div>
<blockquote style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" class="gmail_quote">
<div>
<div class="h5">
<div>COCA looks like a good one. But could I have a copy of the corpus and run my own programs on it? The web interface cannot meet my requirement.</div>
<div> </div>
<div>Thanks,</div>
<div> </div>
<div>Lushan Han</div>
<div>
<div>
<div><br><br> </div>
<div class="gmail_quote">On Mon, Jul 16, 2012 at 5:48 PM, Mark Davies <span dir="ltr"><<a href="mailto:Mark_Davies@byu.edu" target="_blank">Mark_Davies@byu.edu</a>></span> wrote:<br>
<blockquote style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" class="gmail_quote">
<div>>> Does anyone know where or how I can get a well-balanced corpus of modern English, such as BNC, but with a much larger size? I hope it can have at least 1 billion words<br><br></div>It's only 450 million words, but you might try: <a href="http://corpus.byu.edu/coca" target="_blank">http://corpus.byu.edu/coca</a> (COCA)<br>
<br>It is divided evenly into spoken, fiction, popular magazines, newspapers, and academic, each with 90-95 million words.<br><br>It is also much more recent than the BNC. COCA has 20 million words each year, 1990-2012 (compared to the 1993 end date of the BNC).<br>
<br>Finally, it has the same genre balance each year, which makes it nice for looking at recent changes in English; see:<br><br>Davies, Mark. (2011) "The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English". Literary and Linguistic Computing 25: 447-65.<br>
<br>Best,<br><br>Mark Davies<br><br>============================================<br>Mark Davies<br>Professor of Linguistics / Brigham Young University<br><a href="http://davies-linguistics.byu.edu/" target="_blank">http://davies-linguistics.byu.edu/</a><br>
** Corpus design and use // Linguistic databases **<br>** Historical linguistics // Language variation **<br>** English, Spanish, and Portuguese **<br>============================================<br><br><br><br><br>From: <a href="mailto:corpora-bounces@uib.no" target="_blank">corpora-bounces@uib.no</a> [<a href="mailto:corpora-bounces@uib.no" target="_blank">corpora-bounces@uib.no</a>] on behalf of Lushan Han [<a href="mailto:lushan1@umbc.edu" target="_blank">lushan1@umbc.edu</a>]<br>
Sent: Monday, July 16, 2012 1:10 PM<br>To: <a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a><br>Subject: [Corpora-List] ask for very large, well-balanced corpus<br>
<div>
<div><br><br>Dear all,<br><br>Does anyone know where or how I can get a well-balanced corpus of modern English, such as BNC, but with a much larger size? I hope it can have at least 1 billion words. I tried to assemble a corpus from Wikipedia articles but it turned out that such a corpus is not balanced. Wikipedia contains many repetitions of the same type of articles, for example, films or birds.<br>
<br>A Web corpus should be okay for my purpose as long as it was harvested from balanced domains.<br><br><br>Thanks,<br><br>Lushan Han </div></div></blockquote></div><br></div></div><br></div></div>
<div class="im">_______________________________________________<br>UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br><br></div></blockquote></div><span class="HOEnZb"><font color="#888888"><br>
<br clear="all">
<div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk/" target="_blank">University of Leeds</a>
<div><i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk/" target="_blank">the Sketch Engine</a> </div>
<div> <i><a href="http://www.webdante.com/" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i>
<div>========================================</div></div><br></font></span></div></blockquote></div><br>