27.3923, FYI: LAMBADA test set release

Tue Oct 4 17:01:47 UTC 2016

LINGUIST List: Vol-27-3923. Tue Oct 04 2016. ISSN: 1069 - 4875.

Subject: 27.3923, FYI: LAMBADA test set release

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry,
                                   Robert Coté, Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================

Date: Tue, 04 Oct 2016 13:00:25
From: Denis Paperno [denis_paperno at mail.ru]
Subject: LAMBADA test set release

 We are happy to announce the release of the test portion of the LAMBADA
dataset (LAnguage Modeling Broadened to Account for Discourse Aspects).
LAMBADA aims at testing computational models of natural language on their
ability to integrate information from a larger context than a single sentence
or an n-gram window. Current models have a very hard time with discourse
context in general, and with the LAMBADA task specifically (as shown in
Paperno et al. 2016). By releasing the test set, we hope to encourage research
in this area, to help move AI towards real natural language understanding.

In a Nutshell:

LAMBADA website: http://clic.cimec.unitn.it/lambada.

Reference: D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S.
Pezzelle, M. Baroni, G. Boleda and R. Fernández. The LAMBADA dataset: Word
prediction requiring a broad discourse context. Proceedings of ACL 2016 (54th
Annual Meeting of the Association for Computational Linguistics), East
Stroudsburg PA: ACL, pages 1525-1534.

Details:

LAMBADA is a collection of narrative passages sharing the characteristic that
human subjects are able to guess their last word if they are exposed to a long
passage, but not if they only see the last sentence preceding the target word.
For example, this is a sample data point in the dataset:

Context: ''Yes, I thought I was going to lose the baby.'' ''I was scared
too,'' he stated, sincerity flooding his eyes. ''You were?'' ''Yes, of course.
Why do you even ask?'' ''This baby wasn't exactly planned for.''
Target sentence: ''Do you honestly think that I would want you to have a
________?''
Target word: miscarriage

The LAMBADA task consists in predicting the target word given the whole
passage (i.e., the context plus the target sentence). For more information and
download, visit the dataset’s website: http://clic.cimec.unitn.it/lambada.

Acknowledgements: This project has received funding from the European Union's
Horizon 2020 research and innovation programme under the Marie
Sklodowska-Curie grant agreement No 655577 (LOVe); ERC 2011 Starting
Independent Research Grant n.~283554 (COMPOSES); NWO VIDI grant n.~276-89-008
(Asymmetry in Conversation). LAMBADA passages were extracted from the
BookCorpus (http://www.cs.toronto.edu/~mbweb).

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-3923	
----------------------------------------------------------