[Corpora-List] Where can I find a English for Children corpus?

Thu Mar 10 16:18:35 UTC 2011

On 3/10/2011 8:46 AM, Michael Israel wrote:
> There is also a great deal of research based on this data showing that
> the words and grammatical constructions which children learn are in many
> (but not all) respects highly correlated with the frequency with these
> occur in the spoken input that the children hear. So, CHILDES might be
> more relevant than you think.

An analysis of the stages of language learning may provides some useful
clues to the underlying mechanisms.

But stories written for children are notoriously difficult to interpret.
The major problem is that they depend very heavily on background
knowledge that is not easy to verbalize.

Charniak discovered that point 40 years ago:

Charniak 1972: Eugene Charniak, “Toward a Model Of Children's Story
Comprehension,” PhD thesis 1972, MIT, MIT Artificial Intelligence
Laboratory Technical Report TR-266. Also at
ftp://publications.ai.mit.edu/ai-publications/pdf/AITR-266.pdf

A notorious example is the first story in the Dick & Jane series.
Every page is filled with a picture and one line of text,
such as "Oh, look." and "Oh, Oh, Oh."  Eventually it reaches
the level of "See Spot run."

A machine learning system might learn simple grammatical patterns.
But if it can't interpret pictures, it won't learn semantics.

John Sowa

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora