[Corpora-List] Training Corpus for Readability Difficulty

Albretch Mueller lbrtchx at gmail.com
Sat Oct 18 16:42:17 UTC 2008


> We are looking for a training corpus to study readability difficulty.
~
> ... Unfortunately he was unable to share it the last time I asked
~
 this is what I would do:
~
 1) search for educational and children web sites or ask "harry
potter" himself ;-) what kinds of books children (of a certain culture
and age) read. You should finish this step with a long list and be
ready to be oddly amazed by the list; today's children universe has
changed quite a bit (what is one of the most sold video games in
America? One which theme is gunning down poor, powerless (, and black)
Haitian people in Miami ...), then
~
 2) go http://www.gutenberg.org and search for "children" (got 131
hits of books all of them in public domain) you may find some or
similar ones and I am sure you could find a whole lot more
~
 3) try to define "readability" in a more functional and perhaps
measurable way. I would quickly think of a number of features you can
easily (with some not that complicated code) and syntactically get at:
~
 3.1) length of texts (as number of words that are and/or are not content words)
~
 3.2) length of texts' sentences and/or paragraphs
~
 3.3) dependency and "carried-over" sense among paragraphs
~
 You/the code monkey you hire should try to stratify this information
and define some metrics. Without defining "readability" first the type
of corpora you have in mind would be an aimless project
~
 I have thought about these same kinds of things but more in a "X as a
second language" way, say if you speak L1 there are certain syntactic
structures and false cognates in L2 you want to be exposed to. Both
"syntactic structures" and "false cognates" can be measurably account
for and parametrized
~
<blatantly_off_topic_ad>
 I have theorized about and coded such things already and I would work
for food ;-) provided it is an open source project  Also, I speak
English, German and Spanish (willing to learn any language, specially
not Western one)
</blatantly_off_topic_ad>
~
 See you
 lbrtchx

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list