Some comments and comparisons

The numbers offered in the previous section are no more than estimations. They can give us at least orders of magnitude, and allow for comparisons. But they should not be taken as exact data, there are too much sources of error and field for interpretation. In this section, we will discuss some of the more important assumptions made, and the possible sources of error. We will also compare the SLOC counts with the SLOC counts for other system, with the aim of giving the reader some context to interpret the numbers.

What is a source line of code

Since we rely on David Wheeler's sloccount tool for counting SLOC, we also rely on his definition for "physical source lines of code". Therefore, we could say that we identify a SLOC when sloccount identifies a SLOC. However, sloccount has been carefully programmed to honor the usual definition for physical SLOC: " A physical source line of code is a line ending in a newline or end-of-file marker, and which contains at least one non-whitespace non-comment character.';" .

There is other similar measure, the "logical" SLOC, which sometimes is preferred. For instance, a line written in ANSI C with two semicolons would be counted as two logical SLOC, while it would be counted as one physical SLOC. However, for the purposes of this paper (and almost for any purpose), the differences between both definitions of SLOC are negligible, specially when compared to other sources of error and interpretation.

Sources of inaccuracy in the SLOC counts

The counts of lines of code presented in this paper are no more than estimations. By no mean we imply that they are exact, specially when they refer to aggregates of packages. There are several factors which cause this inaccuracy of the numbers, some due to the tools used to count, some others due to the selection of packages:

Estimation of effort and cost

Current estimation models, and specifically COCOMO, only consider classical, proprietary development models. But libre software development models are rather different, and therefore those models are not directly applicable. That way, we can only estimate the cost of the system, had it been developed using classical development models, but not the actual cost (in effort or in money) of the development of the software included in Debian 2.2.

Some of the differences that make it impossible to use those estimation models are:

Some of these factors increase the effort needed to build the software, while some others decrease it. Without analyzing in detail the impact of these (and other) factors, the estimation models in general, and COCOMO in particular, are not directly applicable to libre software development.

Comparison with size estimations for other systems

To put the numbers shown above into context, here we offer estimations for the size of some operating systems, and a more detailed comparison with the estimations for the Red Hat Linux distribution.

As reported in "From NT OS/2 to Windows 2000 and Beyond. A Software-Engineering Odyssey" (for Windows 2000), "More Than a Gigabuck: Estimating GNU/Linux's Size" (for Red Hat Linux), and "Software Complexity and Security" (for the rest of the systems), this is the estimated size for several operating systems, in lines of code (all numbers are just approximations):

Most of this estimations (in fact, all of them, except for Red Hat Linux) are not detailed, and is difficult to know what they consider as a line of code. However, the estimations should be close enough to SLOC counts to be suitable for comparison.

Note also that, while both Red Hat and Debian include many applications, in a lot of cases even several applications in the same category, both Microsoft and Sun operating systems are much more limited in this way. If the more usual applications used in those environments were counted together, their size would be much larger. However, it is also true that all those applications are not developed neither put together by the same team of developers, as is the case in Linux-based distributions.

From these numbers, it can be seen that Linux-based distributions in general, and Debian 2.2 in particular, are some of the largest pieces of software ever put together by a group of developers.

Comparing with Red Hat Linux

The only operating system for which we have found detailed counts of source lines is Red Hat Linux (see "Estimating Linux's Size" and "More Than a Gigabuck: Estimating GNU/Linux's Size;"). Since it is also a Linux-based distribution, and the software packages included in Debian and Red Hat distributions are rather similar, the comparison with it can be illustrative. In addition, since Red Hat Linux very common, and probably the better known Linux-based distribution, comparing with it can provide a good context for the reader already familiar with it.

The first data that surprised us when we counted Debian 2.2 was its size compared to Red Hat 6.2 (released in March 2000) and Red Hat 7.1 (released in April 2001). Debian 2.2 was released in August 2000, and is roughly twice the size of Red Hat 7.1 (released about eight months later) and more than three times the size of Red Hat 6.2 (released five months earlier). Some of these differences could be due to different considerations of which packages to include when counting, but they provide a good idea of the relative sizes, even considering these considerations.

The main factor causing these differences is the number of packages included in each distribution: in the case of Debian we have considered 2630 source packages (with a mean of about 21,300 SLOC per package), while Red Hat 7.1 includes only 612 packages (about 49,000 SLOC per package).

When comparing the largest packages in both distributions, we can find in Debian all those included in Red Hat. The same is not true the other way around: several packages that amount a good quantity of SLOC to Debian are not present in Red Hat. For instance, among the 12 largest packages in Debian 2.2, the following are missing from Red Hat 7.1: PM3 (about 1,115,000 SLOC), OSKit (about 859,000 SLOC), Stalin (805,000), GNAT (688,000), NCBI (591,000). On the contrary, among the 12 largest packages in Red Hat 7.1, none is missing in Debian 2.2.

However, there is a large collection of software packages which is missing in Debian 2.2 and not in Red Hat 7.1: the KDE desktop environment and related utilities. Due to license problems, Debian decided not to include KDE software until after Debian 2.2, when the license for Qt changed to GPL. Therefore, we can say that Debian 2.2 is larger, even missing such a large piece of code as KDE. Just to give an idea, the largest KDE packages in Red Hat 7.1 are kdebase, kdelibs, koffice, and kdemultimedia, which amount for about 1,000,000 SLOC. All of them are missing from Debian. This suggest that should the measures had been made on the current Debian archive (still not officially delivered), the differences would have been greater.

The differences between the same package in each distribution are accountable to the different releases included in them. For instance, the Linux kernel amounts for 1,780,000 SLOC (release 2.2.19) in Debian 2.2, while the same package it amounts for 2,437,000 SLOC (release 2.4.2) in Red Hat 7.1, or XFree includes 1,270,000 SLOC (release 3.3.6) in Debian 2.2, while the release included in Red Hat 7.1 amounts for 1,838,000 (XFree 4.0.3). This differences in releases make it difficult to directly compare the figures for Red Hat and Debian.

The reader should also note that there is a methodological difference between the study on Red Hat and ours on Debian. The former extracts all the source code, and uses MD5 checksums to avoid duplicates across the whole distribution source code. In the case of Debian, we have extracted the packages one by one, only checking for duplicates within packages. However, the total count should not be very affected for this difference.