Everything depends on text type.<br><br><div>BNC-spoken overall has more 'the' than 'I' but that's because half of it is meetings/lectures/sermons. If you look only at the conversational part (obscurely called "demographic") 'I' is more common, in keeping with the kinds of language that James Pennebaker works with (from my recollection of a fascinating talk of his I went to)</div>
<div><br></div><div>Asking for a more representative corpus won't help because we all have different ideas about what it should be representative of</div><div><br></div><div>Adam</div><div><br><div class="gmail_quote">
On 13 September 2011 15:33, Mike Scott <span dir="ltr"><<a href="mailto:mike@lexically.net">mike@lexically.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
On page 45 of the 3 September issue of New Scientist, there is a table giving frequencies of "the 20 most frequently used words in the English languiage, across both spoken and written texts". The first is I, then THE, AND, TO, A, OF, THAT... ME,ON,BUT.<br>
I wrote to the author, James Pennemaker of the U of Texas, about this, expressing my surprise at the pronoun I having greater frequency than THE, as even in the spoken-only section of the BNC (10m words) we find I occurring only just over half as often as THE. His data contains a mix of spoken and written with a large amount of blog data. He reports that with all his studies in the USA and Mexico, "people always use more I more than THE. It's never close."<br>
Can anyone help here, clearing up the position? Someone with access to a really top quality corpus, more up to date and representative than the BNC?<br>
<br>
Mike<br>
<br>
-- <br>
Mike Scott<br>
<br>
***<br>
If you publish research which uses WordSmith, do let me know so I can include it at<br>
<a href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm" target="_blank">http://www.lexically.net/<u></u>wordsmith/corpus_linguistics_<u></u>links/papers_using_wordsmith.<u></u>htm</a><br>
***<br>
University of Aston and Lexical Analysis Software Ltd.<br>
<a href="mailto:mike.scott@aston.ac.uk" target="_blank">mike.scott@aston.ac.uk</a><br>
<a href="http://www.lexically.net" target="_blank">www.lexically.net</a><br>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div><br>
</div>