[Corpora-List] Source code corpora

Klaus Guenther klaus.guenther at split.uni-bamberg.de
Thu Nov 20 19:21:10 UTC 2008


The difficulty with large and important projects such as the Linux 
kernel is that there are few people who are allowed to commit code. If 
someone has code to submit, they provide it for analysis and once it is 
accepted, someone with sufficient karma submits it. So it is not 
possible to simply parse the commit emails by sender to determine the 
author.

In addition, many of the changes are very minute and coding standards 
(CS) are rigorously enforced. Therefore, formatting will not be 
different between developers. Instead, it might be helpful to isolate 
comments and consider the frequency and style, as these generally do not 
follow any standard other than, perhaps, the requirement that they be 
frequent enough to explain the code to a programmer unfamiliar with why 
it is coded the way it is and what each piece of code does.

Smaller projects may be more interesting, especially where standards are 
less tightly enforced. Indeed SourceForge and other open repositories 
provide masses of code that are often written by individuals or small 
teams, were each programmer commits code directly. I have experience 
working with one particular open source endeavor, the PEAR project 
(http://pear.php.net/). The code there is often very diverse, even with 
a coding standard. Older code is not necessarily updated to reflect 
changes to the coding standard, and code reuse is very popular. Each 
individual module (package) is controlled by one or more programmers who 
are fully responsible for its development. Yet many patches are 
submitted by developers who merely use the packages, and they are 
committed by the packages' developers, often being edited.

So the main issue is finding code that can reliably be attributed to an 
author in an unmodified form and discovering details that are not 
attributable to the project's coding standard. I know of no such corpus.

Regards,
Klaus


---
Klaus Guenther, M.A.
University of Bamberg, Germany

Alexandre Rafalovitch schrieb:
> Wouldn't any source code repository with version control system give
> you that automatically? They all tell you exactly which code was
> contributed and by whom.
>
> E.g. SourceForge, Apache or Linux Kernel collections.
>
> http://www.koders.com/ might be a good way to search, if you are
> trying to narrow down to a particular area.
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
>
>
>
> On Thu, Nov 20, 2008 at 1:28 AM,  <sdb at cs.rmit.edu.au> wrote:
>   
>> Dear colleages,
>>
>> My research relates to authorship attribution of source code (that is,
>> determining the owner of anonymous work samples based upon other work
>> samples where authors are known).
>>
>> I'm looking for recommendations for source code corpora for this task
>> for any programming language. For the corpora to be useful, authorship
>> has to be identified.
>>     
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list