[Corpora-List] Source code corpora
Klaus Guenther
klaus.guenther at split.uni-bamberg.de
Thu Nov 20 19:21:10 UTC 2008
The difficulty with large and important projects such as the Linux
kernel is that there are few people who are allowed to commit code. If
someone has code to submit, they provide it for analysis and once it is
accepted, someone with sufficient karma submits it. So it is not
possible to simply parse the commit emails by sender to determine the
author.
In addition, many of the changes are very minute and coding standards
(CS) are rigorously enforced. Therefore, formatting will not be
different between developers. Instead, it might be helpful to isolate
comments and consider the frequency and style, as these generally do not
follow any standard other than, perhaps, the requirement that they be
frequent enough to explain the code to a programmer unfamiliar with why
it is coded the way it is and what each piece of code does.
Smaller projects may be more interesting, especially where standards are
less tightly enforced. Indeed SourceForge and other open repositories
provide masses of code that are often written by individuals or small
teams, were each programmer commits code directly. I have experience
working with one particular open source endeavor, the PEAR project
(http://pear.php.net/). The code there is often very diverse, even with
a coding standard. Older code is not necessarily updated to reflect
changes to the coding standard, and code reuse is very popular. Each
individual module (package) is controlled by one or more programmers who
are fully responsible for its development. Yet many patches are
submitted by developers who merely use the packages, and they are
committed by the packages' developers, often being edited.
So the main issue is finding code that can reliably be attributed to an
author in an unmodified form and discovering details that are not
attributable to the project's coding standard. I know of no such corpus.
Regards,
Klaus
---
Klaus Guenther, M.A.
University of Bamberg, Germany
Alexandre Rafalovitch schrieb:
> Wouldn't any source code repository with version control system give
> you that automatically? They all tell you exactly which code was
> contributed and by whom.
>
> E.g. SourceForge, Apache or Linux Kernel collections.
>
> http://www.koders.com/ might be a good way to search, if you are
> trying to narrow down to a particular area.
>
> Regards,
> Alex.
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
>
>
>
> On Thu, Nov 20, 2008 at 1:28 AM, <sdb at cs.rmit.edu.au> wrote:
>
>> Dear colleages,
>>
>> My research relates to authorship attribution of source code (that is,
>> determining the owner of anonymous work samples based upon other work
>> samples where authors are known).
>>
>> I'm looking for recommendations for source code corpora for this task
>> for any programming language. For the corpora to be useful, authorship
>> has to be identified.
>>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list