[Corpora-List] "Cargo cult" NLP?

Wed Apr 9 06:41:49 UTC 2014

That’s why it’s so important to provide easy access to both standardized datasets and
automatic, standardized scorers.

As an example there was considerable variation in how people evaluated their system, with both a race to the bottom in terms of meaningful evaluation (people wanted to see improvements from noisy world knowledge, so they’d evaluate on gold mentions because that’s the only setting in which that helps a lot), and even people writing ACL papers about how everyone

was doing it wrong weren't above “ choosing not to impute certain errors”.

These are the same issues that also plagued parsing evaluation ca. 1997, and in coreference the SemEval-2010 and CoNLL shared task mean that we now have a dataset that’s fully accessible to everyone (unlike the ACE data, where the testing data was not distributed to participants), with a standardized scorer (written by Emili Sapena for SemEval,with many corrections and improvements from Sameer Pradhan and the others who organized the CoNLL shared task).

So it’s definitely possible to measure the same thing for people, even if it takes some effort.

In NLP, you also want to not only measure the same thing for everyone, but also the right thing - normally people don't want to use a parser because they want to find out exciting new things about PTB section 23, but because they want to use it on 18th century German, or on Arabic blogs, or the next exciting thing. Which is why, once you have one point of reference firmly down, you want to get to another one to see if your assumptions still hold.

So, yes, it’s perfectly possible to do “Cargo cult” style NLP, which is why standardized evaluations and people actually replicating other’s experiments are both important. And I picked established tasks here because earlier mistakes are more visible and well-understood, not because I couldn’t come up with more egregious examples from new and exciting tasks.

-Yannick

Von: Noah A Smith
Gesendet: ‎Mittwoch‎, ‎9‎. ‎April‎ ‎2014 ‎03‎:‎59
An: Kevin B. Cohen
Cc: corpora

What are the "unknown ways" that one NLP researcher's conditions might differ from another NLP researcher's?  If you're empirically measuring runtime, you might have a point.  But if you're using a standardized dataset and automatic evaluation, it seems reasonable to report others' results for comparison.  Since NLP is much more about methodology than scientific hypothesis testing, it's not clear what the "experimental control" should be.  Is it really better to run your own implementation of the competing method?  (Some reviewers would likely complain that you might not have replicated the method properly!)  What about running the other researcher's code yourself?  I don't think that's fundamentally different from reporting others' results, unless you don't trust what they report.  Must I reannotate a Penn Treebank-style corpus every time I want to build a new parser?

--
Noah Smith
Associate Professor
School of Computer Science
Carnegie Mellon University

On Tue, Apr 8, 2014 at 6:57 PM, Kevin B. Cohen <kevin.cohen at gmail.com> wrote:

I was recently reading the Wikipedia page on "cargo cult science," a concept attributed to no lesser a light than Richard Feynman.  I found this on the page:

"An example of cargo cult science is an experiment that uses another researcher's results in lieu of an experimental control. Since the other researcher's conditions might differ from those of the present experiment in unknown ways, differences in the outcome might have no relation to the independent variable under consideration. Other examples, given by Feynman, are from educational research, psychology (particularly parapsychology), and physics. He also mentions other kinds of dishonesty, for example, falsely promoting one's research to secure funding."

If we all had a dime for every NLP paper we've read that used "another researcher's results in lieu of an experimental control," we wouldn't have to work for a living.  

What do you think?  Are we all cargo cultists in this respect?

http://en.wikipedia.org/wiki/Cargo_cult_science

Kev

-- 

Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program, 
U. Colorado School of Medicine
303-916-2417
http://compbio.ucdenver.edu/Hunter_lab/Cohen

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140409/ab162711/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora