How best to sample an American-English corpus?

Francis Hult francis.hult at utsa.edu
Fri Jun 12 01:48:38 UTC 2009


There is 'Word Frequencies in Written and Spoken English' published by Longman:
http://ucrel.lancs.ac.uk/bncfreq/
 
'A Frequency Dictionary of American English' is soon to be released by Routledge:
http://www.routledgelinguistics.com/books/A-Frequency-Dictionary-of-American-English-isbn9780415490641
 
You might find these to be helpful resources.
 
FMH
 
--
Francis M. Hult, Ph.D.
Assistant Professor
Department of Bicultural-Bilingual Studies
University of Texas at San Antonio
 
Web: http://faculty.coehd.utsa.edu/fhult/

________________________________

From: edling-bounces at lists.sis.utsa.edu on behalf of Daniel A. Vogel, Psy.D.
Sent: Thu 6/11/2009 6:10 PM
To: edling at lists.sis.utsa.edu
Cc: info at vogelpsychological.com
Subject: [Edling] How best to sample an American-English corpus?




 I am not a linguist, but a clinical psychologist, and I joined this
 listserv for a limited period of time to obtain some advice.  I am trying
to forge my own list of American English words with corresponding
frequency metrics.

 I need it because as part of a psychological research project, I want to
 correlate the frequency of use of a small smattering of the words uttered
 by my research participants with another variable that is not linguistic
 in nature.  I am trying to figure out the best way to sample words to
 obtain such required frequencies.

 1) It seems, as is true with all research, that the sample is key.  I
 would assume the findings would be both different and less meaningful if I
 sampled the corpus from a popular newspaper or a few dozen popular books.
 After all, I want to compare the word use of subjects from a very average
 American population, average in the sense of reflecting the normal
 distribution of intellectual abilities and educational backgrounds that we
 find in the U.S.A.  Thus, the words used by journalists or popular writers
 would not be the most representative sample?  I wondered if simply
 sampling hundreds of bulletin boards and newsgroups would be the best way
 to sample the average words used by the American population - ranging from
 the most basic to the most complex?  I want the sample to reflect the
 average American - those with lower intellectual abilities (yes, I realize
 many of those will not be on the internet, a confounding variable here
 which I do want to get around somehow - any ideas?) and those with highly
 sophisticated vocabulary who are in the intellectual top 1%.

 2) Also, this would be an easy way to copy and paste large amounts of text
 into a program.  I assume a program exists that will break down all words
 sampled into a list of frequencies.  I wondered if such software is
 already a part of existing word processors such as Word, etc or if
 something more professional is required?  And if so, what recommendations
 would you have?

 I would be deeply grateful for any responses to my two questions!

 I was also advised by this list's moderator to send this question to a
corpus listserv, which I will also do.

Thanks in advance for any responses!

Daniel


_______________________________________________
Edling mailing list
Edling at lists.sis.utsa.edu
https://lists.sis.utsa.edu/mailman/listinfo/edling
List Manager: Francis M. Hult


_______________________________________________
Edling mailing list
Edling at lists.sis.utsa.edu
https://lists.sis.utsa.edu/mailman/listinfo/edling
List Manager: Francis M. Hult



More information about the Edling mailing list