<div dir="ltr">Feynman's piece really is great, and the links between the things he points out and common practice in NLP seem pretty valid to me. Although it is common practice to lift the figures from one paper and compare them to newer work, the basis for doing this seems shaky. <div>

<br></div><div>For example, take the findings described shortlisted best paper at ACL last year, <a href="http://aclweb.org/anthology/P/P13/P13-1166.pdf">http://aclweb.org/anthology/P/P13/P13-1166.pdf</a> . In this paper, researchers were unable to reproduce the results of a relatively uncomplicated and well-described system - even with the help of the original authors. I think this variability is pretty common in practical NLP; each tokeniser, ML implementation, evaluation implementation and so on has its own quirks, and sometimes they're even (gasp) non-deterministic. I am sure that many of us have, after giving an NLP assignment to students, seen solutions that are correct yet span a range of different performance scores.<div>

<div><br></div><div>Why should you trust what others report as a one-off experiment based on closed data and closed code? To me that sounds unscientific; to err is human, and trusting others' results absolutely is to assume perfection from every author in every experiment and over every dataset. There are definitely problems out there, as the above paper demonstrates. I think it is desirable to at least attempt to implement the competing method using as much toolkit overlap as possible when comparing to your own - and, even better, to reduce the emphasis on relatively small performance increases in cases where one could not reproduce prior results.</div>

<div><br></div><div>I'm not sure it's always feasible unless you have the code to hand; unfortunately the code often just isn't there, for understandable reasons (we are busy enough). Are any of our publication venues moving toward a "code & data" requirement for publication? One step in that direction is the explicit declaration of reproducibility by a committee, as done in VLDB; <a href="http://www.vldb.org/2013/experimental_reproducibility.html">http://www.vldb.org/2013/experimental_reproducibility.html</a>. A preference for drawing comparisons against reproducible baselines rewards those who make the effort to share their code.</div>

</div></div><div><br></div><div>All the best,</div><div><br></div><div><br></div><div>Leon</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 9 April 2014 06:16, Jason Eisner <span dir="ltr"><<a href="mailto:jason@cs.jhu.edu" target="_blank">jason@cs.jhu.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><a href="http://calteches.library.caltech.edu/51/02/CargoCult.pdf" target="_blank">Feynman's piece</a> is great, and I often recommend it to students, as well as quoting my favorite line ("The first principle is that you must not fool yourself -- and you are the easiest person to fool.")<br>

<br></div>But I think Noah is right: the particular problem Kevin mentions is not usually an issue in our community.  Our equivalent to comparing two implementations "in the same laboratory" is to compare their accuracy on the same dataset, using the same metric.  It's a reasonable presumption in computational experiments that other differences between labs won't affect the accuracy.  We count on portability of the implementation, and assume that if your implemented method is beating mine, it's not because my lab has inferior machines (lossy memory, smaller word size, Pentium floating-point division bug, improper cooling, buggy compiler ...).  Rather, we figure that my code would have done precisely as badly if run in your lab.  <br>

<br>Yes, there are situations where this argument doesn't apply:<br><br></div><div>* You're comparing speed rather than accuracy.  Speed isn't portable, so speed comparisons should indeed be done on the same machine under the same workload.<br>

</div><div>* You're comparing the accuracy not of two implementations, but of (e.g.) two feature sets.  Then it's important to ensure that the two implementations are matched in all respects other than the feature sets.<br>

</div><div><br></div><div>But people generally seem to recognize these issues and get the comparisons right.<br></div><div>

<br></div>I think a more pressing problem is that we tend to overinterpret our results.  Someone reports that implementation AA of method A does significantly better than implementation BB of method B, when both are trained on dataset D.  The comparison was performed on n samples from test distribution P, using metric M.  But this hardly shows that A will do better than B in other settings.  The advantage might not carry over to other pairs of implementations, other training sets, other test distributions, or other evaluation metrics.  The statistical significance shows only that n was big enough 

(to reject the null hypothesis) when everything else was held fixed.  <br>

<br>This latter concern is more closely related to the traditional demand that studies be <a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028" target="_blank">replicated</a>.  Replication involves some degree of generalization, to determine whether the claimed causes are still able to produce the effect in a new setting.  This is different from merely expecting results to be reproducible (which they are if the code and data are saved)<br>

<br></div>regards, jason<div><div class="h5"><br><div><div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 8, 2014 at 9:59 PM, Noah A Smith <span dir="ltr"><<a href="mailto:nasmith@cs.cmu.edu" target="_blank">nasmith@cs.cmu.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">What are the "unknown ways" that one NLP researcher's conditions might differ from another NLP researcher's?  If you're empirically measuring runtime, you might have a point.  But if you're using a standardized dataset and automatic evaluation, it seems reasonable to report others' results for comparison.  Since NLP is much more about methodology than scientific hypothesis testing, it's not clear what the "experimental control" should be.  Is it really better to run your own implementation of the competing method?  (Some reviewers would likely complain that you might not have replicated the method properly!)  What about running the other researcher's code yourself?  I don't think that's fundamentally different from reporting others' results, unless you don't trust what they report.  Must I reannotate a Penn Treebank-style corpus every time I want to build a new parser?</div>

<div class="gmail_extra"><br clear="all"><div>--<br>Noah Smith<br>Associate Professor<br>School of Computer Science<br>Carnegie Mellon University</div>

<br><br><div class="gmail_quote"><div><div>On Tue, Apr 8, 2014 at 6:57 PM, Kevin B. Cohen <span dir="ltr"><<a href="mailto:kevin.cohen@gmail.com" target="_blank">kevin.cohen@gmail.com</a>></span> wrote:<br>

</div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div>

<div dir="ltr"><div><div>I was recently reading the 

Wikipedia page on "cargo cult science," a concept attributed to no 

lesser a light than Richard Feynman.  I found this on the page:<br><br>"An example of cargo cult science is an experiment that uses another researcher's results in lieu of an <a href="http://en.wikipedia.org/wiki/Experimental_control" title="Experimental control" target="_blank">experimental control</a>.

 Since the other researcher's conditions might differ from those of the 

present experiment in unknown ways, differences in the outcome might 

have no relation to the <a href="http://en.wikipedia.org/wiki/Independent_variable" title="Independent variable" target="_blank">independent variable</a> under consideration. Other examples, given by Feynman, are from <a href="http://en.wikipedia.org/wiki/Educational_research" title="Educational research" target="_blank">educational research</a>, <a href="http://en.wikipedia.org/wiki/Psychology" title="Psychology" target="_blank">psychology</a> (particularly <a href="http://en.wikipedia.org/wiki/Parapsychology" title="Parapsychology" target="_blank">parapsychology</a>), and <a href="http://en.wikipedia.org/wiki/Physics" title="Physics" target="_blank">physics</a>. He also mentions other kinds of dishonesty, for example, falsely promoting one's research to secure funding."<br>

<br></div>If we all had a dime for every NLP paper we've read that used "another researcher's results in lieu of an 

experimental control," we wouldn't have to work for a living.  <br><br>What do you think?  Are we all cargo cultists in this respect?<br><br><a href="http://en.wikipedia.org/wiki/Cargo_cult_science" target="_blank">http://en.wikipedia.org/wiki/Cargo_cult_science</a><br>

<br></div>Kev<span><font color="#888888"><br><br clear="all"><br>-- <br><div dir="ltr">Kevin Bretonnel Cohen, PhD<br>Biomedical Text Mining Group Lead, Computational Bioscience Program, <br>U. Colorado School of Medicine<br>

<a href="tel:303-916-2417" value="+13039162417" target="_blank">303-916-2417</a><br><a href="http://compbio.ucdenver.edu/Hunter_lab/Cohen" target="_blank">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a><br>

<br><br><br></div>

</font></span></div>

<br></div></div>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br></div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br></div></div></div></div></div></div></div></div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Leon R A Derczynski<br>Research Associate, NLP Group<div><br></div><div>Department of Computer Science</div><div>University of Sheffield, UK<br>

<br><a href="http://www.dcs.shef.ac.uk/~leon/" target="_blank">http://www.dcs.shef.ac.uk/~leon/</a></div></div>

</div>