6 Analyzing software-measurement data


A. Törn - Contents - - Previous chapter - Next chapter - - Previous page - Next page

6.3 Examples of simple analysis techniques, box plots

Since the measurements may not be on a ratio scale, it is more appropriate to use the median and quartiles to define central location and spread. These robust statistics can be presented in visual form called a box plot.

In a box plot the data set is split into four parts using the median m, the upper quartile u, and the lower quartile l, see Figure 6.7. The computation of the median and the quartiles could be made in the following way. If the number of entities in the data set is odd, then the median is the middle value. If the number of items is even, then the median is the mean of the two middle values. In both cases the lower and upper quartile are computed as the median of the data below and above the median respectively.

The length of the box is defined as d = u - l. Further the upper tail is defined as u +1.5d, and the lower tail as l - 1.5d. If the lower tail, computed in this way, is less than the lowest possible value it must be truncated to this lowest value. Attribute values outside lower tail - upper tail   are called outliers, and are pointing at entities needing special attention.

Three box plots for different attributes of one data set are shown in Figure 6.8. MOD is average modul size in LOC, FD is number of faults found per KLOC. The outlier data confirm the widely held belief that a system should be composed of modules that are neither too small nor too large.

6.3 Examples of simple analysis techniques, scatter plots

When items are characterized by two or more attributes a scatter plot may reveal associations between attributes, see Figure 6.11.

We can see that the number of faults seems to increase with module size. We also see that module 17 (size=549, faults=16) has less faults than expected if extrapolating from the trend.

For determining the measure of association, regression techniques can be used.