Rishab Aiyer Ghosh on Mon, 8 May 2000 16:52:53 +0200 (CEST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> OFSS01: First Orbiten Free Software Survey


OFSS01: The Orbiten Free Software Survey, 1st edition, May 2000

Copyright (C)2000 Orbiten Research - http://orbiten.org

May be distributed freely without modification.
Press contact, or for more information: ofss@orbiten.org

HEADLINE: Over 12,000 authors, 25 million lines of code analysed

Inside: FINDINGS
        DATA
        SCOPE AND METHOD
        CONTEXT: ORBITEN
        REFERENCES

The Free Software (or Open Source) "Community" is much talked about,
though little hard data on this community and its activities is available.
Here, for the first time, Orbiten Research (see CONTEXT) provides a body
of empirical data and analysis to explain what this community actually is. 

Simple facts, such as the number of developers contributing to free
software projects, the number of such projects and their size have been
until now unknown. The Orbiten Free Software Survey discovers these facts,
and aims with them to provide a foundation for empirical research on the
free software community. 

Building on the release of CODD[1] over a year ago, the Survey will
measure and track over time several aspects of the free software economy
including: the concentration (or diversity) of contributions and
contributors; the degree of intersection between projects and sharing of
code; the participation of developers in different projects; volatility of
changes to the code base and the developer base. 

There will also be some basic statistics and data gained during the survey
process - such as total size of free software available, amount of free
software being released and/or modified each month, compendium of
developers. 

Hopefully the survey will be regular, prompt and gradually more
comprehensive, providing an important source of information for academic
researchers, free software users and developers alike. 

Rishab Aiyer Ghosh & Vipul Ved Prakash:  May 7, 2000


FINDINGS


The primary findings of OFSS01 were basic: the number of developers
authoring projects included in the survey (12706), the size of the free
software code base (1.04 Gigabytes, or roughly 25 mil lines), the number
of identifiable free software projects (3149). Given the total lack of
data on the free software economy, rough indicators as to its size
(limited by the initial scope of the survey) are, we believe, a good
start. 

Secondary findings relate to the degree of contribution to the code base
by individual authors, defined for the purposes of this survey as the
smallest identifiable grouping claiming credit for development of a
software project. Unsurprisingly, the Free Software Foundation came out
well ahead of anyone else by far, credited with 11% (124 Mb) of the entire
surveyed code base and involved in 17% (546) of all identifiable projects. 
However, as with some other well-known (and highly ranked in the survey)
Unix authors, such as Sun Microsystems and the Regents of the University
of California, the FSF's position in our charts stems largely from the
lack of credit given to individual programmers. A list of the top few
contributors sorted by code and involvement in projects is given below
(see DATA). 

Further findings relate to the distribution of authors among projects, and
code base contribution. The top 1271 authors, 10% of the total, accounted
for 72.3% of the total code base. The top 10 authors alone (0.08% of the
total) are credited for 19.8% of the code base. Free software development
may be distributed, but it is most certainly very top heavy. 

What goes for lines of code written goes for involvement in projects too.
Only the top 25 authors (0.19% of the total) were credited with
participation in more than 25 projects. The top 250 authors were credited
with participation in over 5 projects, and the vast majority (over 77%) of
authors were only involved in a single project. Our conclusion: Free
software development is less a bazaar of several developers involved in
several projects, more a collation of projects developed single-mindedly
by a large number of authors.  DATA

Number of identifiable authors: 12706
Uncredited/unidentifiable authors: 790
% of code base uncredited: 8.37%
Size of code base: +1116500467 Bytes or 1067 Mb.
Number of identifiable projects: 3149


Table 1: Top 10 authors ranked by contribution of code


Author                                      % of total
free software foundation, inc                  11.231
sun microsystems, inc                           1.848
the regents of the university of california     1.359
gordon matzigkeit                               1.216
paul houle                                      1.042
thomas g. lane                                  0.782
the massachusetts institute of technology       0.762
ulrich drepper                                  0.559
lyle johnson                                    0.528
peter miller                                    0.525



Table 2: Author contribution by decile

Authors           % of total
top 10 authors       19.854
top decile (1271)    72.320
2nd decile            8.928
3rd decile            4.062
4th decile            2.384
5th decile            1.515
6th decile            1.008
7th decile            0.672
8th decile            0.440
9th decile            0.239
10th decile           0.060


Table 3: Top 10 authors ranked by participation in projects

Author                                      Projects
free software foundation, inc                   546
gordon matzigkeit                               267
the regents of the university of california     156
ulrich drepper                                  142
roland mcgrath                                   99
sun microsystems, inc                            66
rsa data security, inc                           59
martijn pieterse                                 50
eric young                                       48
login-vern                                       47


Table 4: Author participation in projects

Projects    Authors
> 25             25
6 - 24          211
3 - 5           928
Only 2         1924
Only 1         9617

Note: 211 authors participated in 6 to 24 projects, etc.


Further data, graphics and complete tables available at
orbiten.org


SCOPE AND METHOD

The first Orbiten Free Software Survey has been prepared based on over 18
months of work in identifying, tracking and modeling interaction in the
free software economy. Clearly this was not enough time, and the scope and
methodology of the first survey is far from ideal. 

The technical task of identifying credits in poorly documented source code
was complex, especially given the vast and changing nature of the code
base. Credits are often not available, they rarely follow a set format,
and various heuristics have been applied and "policy" decisions made on,
for example, how to divide credit among multiple listed authors. Details
can be found in the documentation for CODD[1]. 

The code base itself was limited. Although far from being a complete set
of all code ever released without payment on the Internet - our ideal,
eventual goal - we believe we have used a fairly representative sample of
software projects (released under the GNU Public Licence and its variants)
developed in recent years. 

The source code base for OFSS01 is:
 *  RedHat Linux v6.1 source rpms, including Linux kernel 2.2.14
 *  Munitions cryptography/security archive as on January 11,
       2000 [http://munitions.vipul.net]
 *  Approximately 50% of source code available through Freshmeat
       as on January 5, 2000. Explanation: source code is not
       easily available for all projects on Freshmeat, at least
       when accessed through an automated script with simple
       intelligence. [http://freshmeat.net]
       
For each module or package analysed, source code is broken into projects
identified according to the package distribution.  Source code and some
documentation files are scanned for authorship, credit or copyright
information, from which author names are identified. Data collected
includes, for each identified author, number of bytes of code authored,
number and names of projects authored. From this the degree of
contribution, in terms of bytes of code can be calculated for any given
project. Project data is collated to form a broader picture of authorship
distribution, which can be examined at several levels. 

In this survey, very basic analysis has been performed. The next survey
will broaden the scope of analysis to include features such as the degree
of cross-participation between projects and groups of authors. 

The next survey - planned for June - will also use a bigger code base. At
the very least the code base will expand to include Sourceforge
[http://sourceforge.net], OpenBSD [http://openbsd.org] and Perl CPAN
libraries [http://cpan.org]. 

As the survey continues and becomes more frequent, we plan to track
changes in the code base over time (including historical perspectives
using older versions of, say, the Linux kernel) and monitor movement
between projects and groups. 


CONTEXT: ORBITEN

Orbiten Research is devoted to the practical understanding of Cooking-pot
networks[2], the economic model for trans-monetary phenomena on the
Internet. A special focus is on developing tools of measurement and
generating data on the production, use and trade in free ("open source")
software. 

Modelling communities and economic activity usually depends on
measurement, which is why it seems very hard to model cooking- pot
networks - such as the community of free software developers. Orbiten
plans to develop and use various methods of getting around the "problems"
of cooking-pot networks, of modelling and understanding them so that their
benefits can be truly appreciated and worked with. A summary of these
methods can be found[3] on the Orbiten web site. 

REFERENCES
[1] CODD documentation, Orbiten. 
    http://orbiten.org/codd/
[2] "Cooking-pot markets" by Rishab Aiyer Ghosh, First Monday,
    Issue 3 Volume 3 March 1998.
    http://www.firstmonday.org/issues/issue3_3/ghosh/
[3] "Identifying, tracking and measuring activity in cooking-pot
    networks" by Rishab Aiyer Ghosh, Orbiten.
    http://orbiten.org/summary.html
    
    

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: majordomo@bbs.thing.net and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: nettime@bbs.thing.net