[Corpora-List] Data-Driven Learning materials

Fri Apr 18 09:58:49 UTC 2008

Serge,

write-up on what we have done so far is

(Kilgarriff, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý) GDEX:
Automatically finding good dictionary examples in a
corpus<http://www.kilgarriff.co.uk/Publications/2008-KilgEtAl-euralex-gdex.doc>.
Proc EURALEX 2008, Barcelona, Spain.

Easiest heuristics include sentence-length and word frequency.  Lots
of uppercase and/or punctuation is bad news (specially for text from the
web).  We are exploring more grammar (using a parser and/or the low-tech
alternative - penalising sentences with long noun sequences or lots of
tensed verbs) and also language modelling (since readable sentences will
tend to have high probability in a probabilistic lg model - thanks to
Claudia Leacock for this idea, she is doing similar things at Butler Hill
Group.)

There's good current work on readability by Isahara's group in Nara, Japan,
see eg

Kotani, K., T. Yoshimi, T. Kutsumi, I. Sata, and H. Isahara 2008.  EFL
Learner Reading Time Model for Evaluating Reading
Proficiency<http://www.gelbukh.com/cicling/2008/FirstPages/Paper8723.pdf>
.  Proc CICLING, Haifa, Israel.
There's also the tradition going back to Fleisch and others in early 20th
century (see discussion in the paper)

Do say more about what you did (tho maybe not on the list)

Adam

2008/4/16 Serge Sharoff <s.sharoff at leeds.ac.uk>:

> Adam,
> I wonder which method you are using for ranking examples.  We were
> trying to do something similar, but for the whole webpages (and a
> variety of languages).  For example, we ranked the English wikipedia and
> my I-EN corpus by their coverage by GSL words,
> http://corpus.leeds.ac.uk/teaching/i-en-gsl.csv.bz2
> http://corpus.leeds.ac.uk/teaching/wiki-en-gsl.csv.bz2
>
> The problem is that many pages with low lexical coverage by GSL contain
> words that are known anyway, e.g., computer or construction.  On the
> other hand, many phrasal verbs, e.g. 'give up' or constructions, 'go
> extra mile', do contribute to the lexical count, but are not understood
> by students.  Problems of this sort are not accidental (we found little
> correlation between the GSL coverage and understanding), a much better
> model of difficulty is needed to find texts/examples suitable for
> language learners.
> Serge
>
> On Wed, 2008-04-16 at 12:53 +0100, Adam Kilgarriff wrote:
> > Dear Alex,
> >
> > you say
> > >  Is there really so little out there? Why?
> >
> >
> > I think the reason is simple: Concordances are too tough for learners.
> > So DDL has not taken off.  After 20 years, it remains a tiny minority
> > interest.
> >
> > Our response is to select corpus sentences according to readability.
> > The beta version of the Sketch Engine now has an option to sort
> > concordances
> > "best first", from a learner's point of view, and we are working on
> > other ways of
> > using corpora in language learning in which we only show
> > users sentences which they are likely to be able to read and
> > understand.
> >
> > Adam
> >
> > 2008/4/15 Alex Boulton <Alex.Boulton at univ-nancy2.fr>:
> >         Dear all
> >
> >
> >
> >         I recently requested information on any published materials or
> >         on-line materials
> >
> >
> >         adopting a data-driven learning approach. My thanks to the
> >         following for their replies:
> >
> >               * Adam Turner
> >               * Chris Tribble
> >               * Mike Barlow
> >               * Brett Reynolds
> >               * Stéphanie O'Riordan
> >               * Antoinette Renouf
> >               * James Thomas
> >               * Linda Bawcom
> >               * Marcia Veirano Pinto
> >               * Przemek Kaszubski
> >               * Simon Smith
> >               * John Milton
> >
> >         Unfortunately (if unsurprisingly), there were no real
> >         additions to the publications
> >
> >
> >         I listed in the original mail. Is there really so little out
> >         there? Why?
> >
> >
> > ...
> >
> > --
> > ================================================
> > Adam Kilgarriff http://www.kilgarriff.co.uk
> > Lexical Computing Ltd http://www.sketchengine.co.uk
> > Lexicography MasterClass Ltd http://www.lexmasterclass.com
> > Universities of Leeds and Sussex adam at lexmasterclass.com
> > ================================================
>  > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>

-- 
================================================
Adam Kilgarriff http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080418/7d6148c2/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora