[Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Ted Pedersen tpederse at d.umn.edu
Sun Jul 3 18:10:05 UTC 2011


Greetings all,

I made a few remarks during the ACL 2011 business meeting in favor of the
innovation this year on allowing submissions of data and code along with
paper submissions. I suggested this is something we want to continue and
encourage, particularly for papers submitted to the empirical track at ACL
(which is the majority of papers these days) so that we might be able to
reproduce results more easily. I had some slides prepared that I didn't use,
but I've put those here that summarize part of what I said at least (I
forgot a few points, but the gist is fairly consistent I guess...):

http://www.slideshare.net/duluthted/pedersen-acl2011businessmeeting

There were quite a few comments thereafter and I took a few notes, and I
guess I thought it would be possibly useful to preserve these "for the
record" at least, since I think that discussion raised many of the common
concerns about this issue. It might also be an opportunity for folks to
follow up or at least continue thinking.

Below are the comments, approximately in the order made....note that I'm
trying here to simply reproduce the gist of comments, and not offer any
opinion on them. I think it was great there was such an extensive
discussion, and I guess I just wanted to note that and preserve it as best I
could. If anyone feels like they have been misquoted, forgotten, or
misunderstood, please feel free to jump in and elaborate.

0) Speaker was in support of the encouraging more submissions of code and
data, and noted that he was happy to see quite a few presentations at ACL
where code and data were being made available.

1) Data is sometimes expensive to create (especially speech data) and
releasing it after one publication may not be in the best interests of the
creators.

2) Reviewing code is time consuming (and another concern raised during the
business meeting was reviewer overload, so this certainly fit into that
theme).

3) It is often hard or impossible for people in industrial settings to
release code - the licensing issues are sometimes very complex and would
need to be resolved before any code was submitted.

4) There could be a prize offered for the best code / best data submitted .

5) It is hard to know how to review software.

6) Maybe software could be made available on an ACL cloud, in order to solve
some licensing concerns (especially of industry)

7) Code at submission time is very hard to anonymize - maybe we
need separate reviewers for code and data (from paper).

8) Simply releasing or submitting code isn't necessarily useful (if it is
bad code). How do we make sure the code is of high quality and/or useful?

9) There is a tension between having new and exciting ideas and producing
well engineered code. Put another way, there's a tension between pushing the
envelope and playing it safe. The speaker was concerned we might be moving
too far away from encouraging new ideas.

10) Releasing code will in the end help the impact of work. If you look at
high impact work in our field, it often centers around a resource (eg Penn
Treebank). Releasing code can also help people in industry, because
sometimes publishing code is the only way that it will ever get out (eg
sentence alignment code from CL in 1993 by Gale and Church)

11) Have a retroactive prize after a few years for software systems that are
released and are proven to have some impact.

12) During the discussion of the new journal, it was mentioned that maybe
that could be a vehicle for releasing code and data.

I'm grateful that the ACL opened up the business meeting to these kinds of
remarks, and really appreciate both the opportunity to say a few words, and
also hear all these different views. It's given me a lot to think about, and
I just wanted to pass along my notes in the hopes of encouraging others to
do the same. Keep talking. :)

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110703/47aaecf8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list