[Corpora-List] Training Corpus for Readability Difficulty

Sat Oct 18 16:55:15 UTC 2008

Some of Bormuth's passages along with their readability scores are available
for free in ERIC documents (at least 32 of the passages).

This is a brief overview of the passages

Bormuth's (1971) corpus of 32 academic reading texts features texts taken
from school instructional material and includes passages from biology,
chemistry, civics, current affairs, economics, geography, history,
literature, mathematics, and physics The mean length of the texts was 269.28
words (SD = 16.27) and the mean number of sentences per hundred words was
7.10 (SD = 2.81).

The problem is the minimal number of passages which constrain the number of
variables you can statistically analyze without overfitting the model.

Here are the references:

Bormuth, J. R. (1969). Development of readability analyses (Final Report,
Project No. 7-0052, Contract No. 1, OEC-3-7-070052-0326). Washington, DC: U.
S. Office of Education. 

Bormuth, J. R. (1971). Development of standards of readability: Toward a
rational criterion of passage performance. U. S. Department of Health,
Education and Welfare (ERIC Doc. No. ED O54 233).

Let me know if that helps

Scott Crossley, Ph.D.
Linguistics/TESOL

Department of English
Mississippi State University
http://www.msstate.edu/dept/english/tesol/tesolfaculty.html
(662) 325-2355

Institute for Intelligent Systems
University of Memphis
http://mnemosyne.csl.psyc.memphis.edu/iis/

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Albretch Mueller
Sent: Saturday, October 18, 2008 11:42 AM
To: corpora at uib.no
Subject: [Corpora-List] Training Corpus for Readability Difficulty

> We are looking for a training corpus to study readability difficulty.
~
> ... Unfortunately he was unable to share it the last time I asked
~
 this is what I would do:
~
 1) search for educational and children web sites or ask "harry
potter" himself ;-) what kinds of books children (of a certain culture
and age) read. You should finish this step with a long list and be
ready to be oddly amazed by the list; today's children universe has
changed quite a bit (what is one of the most sold video games in
America? One which theme is gunning down poor, powerless (, and black)
Haitian people in Miami ...), then
~
 2) go http://www.gutenberg.org and search for "children" (got 131
hits of books all of them in public domain) you may find some or
similar ones and I am sure you could find a whole lot more
~
 3) try to define "readability" in a more functional and perhaps
measurable way. I would quickly think of a number of features you can
easily (with some not that complicated code) and syntactically get at:
~
 3.1) length of texts (as number of words that are and/or are not content
words)
~
 3.2) length of texts' sentences and/or paragraphs
~
 3.3) dependency and "carried-over" sense among paragraphs
~
 You/the code monkey you hire should try to stratify this information
and define some metrics. Without defining "readability" first the type
of corpora you have in mind would be an aimless project
~
 I have thought about these same kinds of things but more in a "X as a
second language" way, say if you speak L1 there are certain syntactic
structures and false cognates in L2 you want to be exposed to. Both
"syntactic structures" and "false cognates" can be measurably account
for and parametrized
~
<blatantly_off_topic_ad>
 I have theorized about and coded such things already and I would work
for food ;-) provided it is an open source project  Also, I speak
English, German and Spanish (willing to learn any language, specially
not Western one)
</blatantly_off_topic_ad>
~
 See you
 lbrtchx

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora