How best to sample an American-English corpus?

Daniel A. Vogel, Psy.D. info at vogelpsychological.com
Thu Jun 11 23:10:54 UTC 2009


 I am not a linguist, but a clinical psychologist, and I joined this
 listserv for a limited period of time to obtain some advice.  I am trying
to forge my own list of American English words with corresponding
frequency metrics.

 I need it because as part of a psychological research project, I want to
 correlate the frequency of use of a small smattering of the words uttered
 by my research participants with another variable that is not linguistic
 in nature.  I am trying to figure out the best way to sample words to
 obtain such required frequencies.

 1) It seems, as is true with all research, that the sample is key.  I
 would assume the findings would be both different and less meaningful if I
 sampled the corpus from a popular newspaper or a few dozen popular books.
 After all, I want to compare the word use of subjects from a very average
 American population, average in the sense of reflecting the normal
 distribution of intellectual abilities and educational backgrounds that we
 find in the U.S.A.  Thus, the words used by journalists or popular writers
 would not be the most representative sample?  I wondered if simply
 sampling hundreds of bulletin boards and newsgroups would be the best way
 to sample the average words used by the American population - ranging from
 the most basic to the most complex?  I want the sample to reflect the
 average American - those with lower intellectual abilities (yes, I realize
 many of those will not be on the internet, a confounding variable here
 which I do want to get around somehow - any ideas?) and those with highly
 sophisticated vocabulary who are in the intellectual top 1%.

 2) Also, this would be an easy way to copy and paste large amounts of text
 into a program.  I assume a program exists that will break down all words
 sampled into a list of frequencies.  I wondered if such software is
 already a part of existing word processors such as Word, etc or if
 something more professional is required?  And if so, what recommendations
 would you have?

 I would be deeply grateful for any responses to my two questions!

 I was also advised by this list's moderator to send this question to a
corpus listserv, which I will also do.

Thanks in advance for any responses!

Daniel


_______________________________________________
Edling mailing list
Edling at lists.sis.utsa.edu
https://lists.sis.utsa.edu/mailman/listinfo/edling
List Manager: Francis M. Hult



More information about the Edling mailing list