Lab 2: Describing and Comparing Two or More Data Sets

Often an experiment or observation is important because of its relationship to other measurements. This lab will present some statistical and graphical tools for comparing two or more data sets. In learning about about these techniques, several different types of data will be used as examples.

The most important concept is the distinction between experimental data and observational data. Experimental data are usually based on a careful design where some variables (factors) of interest are controlled or preset by the experimenter. Observational data are collected without the benefit of a controlled experimental environment. Most meteorological data are observational because there is no way to control weather variables. In this lab you will learn to recognize both types of data and to summarize their important features.

Experimental Data

The data set drill.bit contains the results of testing two types of drill bits in the manufacture of compressors. There were two brands considered (made by Besley and Cleveland), and the measurements are the number of holes drilled until the bit breaks. The tests were done under the same manufacturing conditions, and the influence on performance due to factors other than the brand were minimized. Perhaps the simplest summary for a small data set is just plotting the values versus brand. Since the plot function cannot handle character values, we use the function lplot which is set up to plot numbers versus a character vector of categories:

lplot(drill.bit$brand,drill.bit$holes)

A better shaped plot often results from first using set.panel(2,1). Also in order to save typing we may first copy drill.bit to z

drill.bit -> z

and then just type

lplot(z$brand,z$holes)

From this plot it is clear that the Cleveland bits, on the average, tend to drill about half as many holes before breaking. But this relationship is not perfect: the best Cleveland bit actually drilled more holes than the poorest Besly bit.

One reason these data were collected was to evaluate the cost tradeoffs for these two brands. The prices for the bits are different and are reported as the third column of drill.bit (Besly $9.64 and Cleveland $4.48). In order to make a cost comparison it is also useful to consider the price per hole achieved by each drill bit. The dot plot above of the cost per hole was made by

lplot(z$brand,z$price/z$holes, ylab='Price per Hole')
title('Price per Hole for Drill Bits')

By taking cost into account, it is not as obvious which brand is better. The Cleveland brand appears to be about one cent higher, but there is also a substantial amount of variability. For these plots it appears that the Cleveland brand had one or two bits with unusually large costs. To get a more precise description of the distributions, one can calculate some summary statistics of the costs for each brand.

stats(z$price/z$holes,by=z$brand)

                    besly   cleveland
             N 6.000000000 8.00000000
          mean 0.028507448 0.03769031
      Std.Dev. 0.006755421 0.01646174
           min 0.021809955 0.01709924
            Q1 0.023319190 0.03090796
        median 0.026783969 0.03556452
            Q3 0.032786746 0.03971490
           max 0.038714859 0.07111111
missing values 0.000000000 0.00000000

The mean and medians measure the center of the distribution and thus correspond to locations of the middle of points in the dot diagrams. These estimates are reasonable judging from where points tend to cluster on the dot diagrams. The larger scatter for the Cleveland bits seen on the dot diagrams is confirmed by the larger standard deviation from stats.

Which is the better method for summarizing these data? A graphical presentation with dot diagrams or the table of statistics from stats ? The answer is that both are important and when used together give much more information than either one alone. The difference in means (.0377-.0285 = .009 dollars or .9 cents) provides an estimate of the cost advantage. However the dot diagrams indicate that the mean of the Cleveland bits may not be the right measure since its estimate can be influenced by a few large values. In fact in both cases the median costs are smaller than the means, but the difference in median costs is still about .9 cents.

Boxplots

For large data sets dot diagrams become difficult to interpret. Except for the extreme values, the plotted dots tend to pile up and obscure features of the data's distribution. Since summary statistics such as the mean or the percentiles have the same interpretation for a large data set as well as for a small one, we use boxplots to display some of the most important statistics. Boxplots show basic features of a distribution without plotting every point. The function bplot is designed to produce boxplots for several data sets lined up on the same scale. For the drill bit data one can use the by parameter to divide the costs into the two brands (note similarity to the by argument of stats):

bplot(z$price/z$holes,by=z$brand, ylab='Price per Hole')
title('Price per Hole for Drill Bits')

Each box (see top of next page) is drawn between the 25th and 75th percentiles of the data, and a line is drawn at the median. Thus the box depicts the middle part of the distribution. In fact we know that only 25% of the values will lie above the box and 25% will lie below. To give a sense of the range of the data, lines called whiskers are drawn from the ends of the boxes to the 5th and 95th percentiles. The last step is to plot the values in the data set that lie beyond the range of the whiskers since the extreme values in a data set are often important. The drill bit data have been used as an example because we have already examined its distribution. The boxplots reflect the same information that we obtained from the dot diagram. The tight clustering of the middle four values for the Cleveland brand is reflected in the boxplot by the small box and large whiskers.

The bplot function can take several different types of data sets and produce sensible results. The example above shows how to describe the different groups using a by option. If zork is the name of a data frame or list, then bplot(zork) will line up boxplots for the different columns or components in zork using a common scale. The data sets can also be fed to bplot separately. If z1, z2, and z3 are individual data sets, then

bplot(z1,z2,z3)

will line up three boxplots on a common scale.


A Factorial Design

The next example is an experiment that has more complicated structure. Small propulsion units, called actuators, are used to maneuver space craft once they are in space. In order to control these motions accurately, the actuator needs to produce a precise amount of force. The data set actuator is an experiment to understand what factors effect the variability of the force produced by an actuator. The actuator is fired using compressed air, and the factors studied are the actuator used (act), the amount of pressure used (press ), the length of the air supply line ( line) and the nozzle type ( nozzle). The experimental results are formatted as a data frame.

actuator

   act  press line    nozzle    force order
 1  A2  30psi 20ft  straight 0.099202    11
 2  A1  30psi 20ft  straight 0.070762     5
 3  A2  30psi 40ft  straight 0.075433    12
 4  A1  30psi 40ft  straight 0.068982     8
 5  A2 100psi 20ft  straight 0.487464     3
 6  A1 100psi 20ft  straight 0.442802     2
 7  A2 100psi 40ft  straight 0.503937     9
 8  A1 100psi 40ft  straight 0.370022     6
 9  A2  30psi 20ft right.ang 0.083024     7
10  A1  30psi 20ft right.ang 0.058082    15
11  A2  30psi 40ft right.ang 0.067684     1
12  A1  30psi 40ft right.ang 0.049787     4
13  A2 100psi 20ft right.ang 0.486403    10
14  A1 100psi 20ft right.ang 0.390816    16
15  A2 100psi 40ft right.ang 0.486725    14
16  A1 100psi 40ft right.ang 0.372554    13

Each of these factors has two different levels, and all 16 possible combinations were studied. Good experimental procedure is to make these runs in a random order to reduce any bias due to a possible change in lab conditions over time. The actual order that the data was generated is reported in the last column of the data frame. For example, the first run in this experiment was made on the second actuator, at a pressure of 30 psi, using a 40 foot air supply line and a right angle nozzle (see the row). The resulting force was .067684 pounds.

A simple summary of these data is to make a series of boxplots by dividing the data based on each factor. For example, to look at the actuator factor:

bplot(actuator$force,by=actuator$act)

Now repeat this for each of the factors press, line, nozzle in order to see which ones make a difference on the response variable force.

Since we need to examine 4 factors, one could put all the plots on a single page using set.panel(2,2). This setting will produce a table of graphs.

This data set may seem confusing to analyze because all of the runs are different. The sequence of boxplots only considers one variable at a time, ignoring the levels of the other factors. Even though runs 11 and 5 use the same supply line pressure they use different actuators. How can one make a comparison if some of the factors are always at different levels? This property is actually a strength of the design and with the correct statistical techniques makes it possible to investigate how these factors interact with one another.

An interaction between two variables occurs when the observed response is not the sum of the effects from the individual factors. Although a thorough statistical analysis of this type of data will be deferred to Lab 14, one can still present the interactions among factors using plots. Based on a more complete statistical analysis, an interaction was identified between the actuator and the pressure. To examine this case we need to plot the force measurements against one of the factors but label the plotted points using the other factor:

lplot(actuator$act,actuator$force, actuator$press)

Here the force is plotted against the two actuator categories but the points are labeled with the pressure categories.

Based on this graph the difference in force between the two actuators depends strongly on what pressure level is being used. At high pressure levels there is a much larger discrepancy between the two actuators. Since consistency among actuators is desirable, this interaction is important but came as surprise to the experimenter. Before this data was collected, it was assumed that the actuators functioned in a very similar manner. A simple conclusion from this experiment might be that more consistent forces can be obtained across different actuators if one sticks to low pressures.


Observational Data

The data set quake contains the strengths of earthquakes measured at the earths continental plates. This data set has a very different character than testing actuators! The experimenter cannot specify where or when earthquakes will happen and cannot control the environment to make the experimental conditions similar. Once an earthquake occurs in a region, the geology of the region has changed slightly and subsequent earthquakes may be effected by this new orientation. Data that are not the result of a designed experiment are termed observational.

To see the structure of the data set quake, type

view(quake)

The function view is used here instead of just typing quake because this data set has 496 rows, and it is painful to watch 496 rows of data printed to the screen. view prints out the first 10 rows of a data set or the first k if you specify view(filename,k).

Much of the earth's seismic activity is due to motion of the large plates that make up the crust of the earth. Earthquakes occur when a buildup in tension between two layers of rock is suddenly released. For this reason many earthquakes occur at plate boundaries. The component strength in this data set gives the strength of a particular earthquake. The components lon and lat give the location of the event, and plate.ID is a numerical code for the plate boundary. To view some of the plate boundaries one can just add the locations of the earthquakes in this data set to a map of the world.

set.panel(1,1)
world()
points(quake$lon,quake$lat)

Note that the plate boundaries often occur in the middle of the

oceans but not surprising one is along the California coast. To put the plate ID on the map instead of just a point use text:

world()
text(quake$lon,quake$lat,quake$plate.ID)

A basic issue with these data is whether the strength of an earthquake depends on the plate boundary. Perhaps some plates are more dynamic than others. Lining up boxplots is an efficient way to summarize a large number of groups.

bplot(quake$strength,by=quake$plate.ID)

From these boxplots we see that the range of the strengths (whiskers) is much more variable that the center (median line) across different plate boundaries. Also the distributions tend to be asymmetric. The part of the box and the whisker above the median is longer than these features above the median. Thus, relatively speaking, smaller earthquakes are closer to the median than the largest ones.

In viewing these boxplots, keep in mind that the sample sizes differ among the plate groups. In fact when groups have less than 5 observations, a boxplot is not drawn and only the individuals points are plotted.

One feature of these data is the relationship between the scatter of the strengths and the mean level. One measure of scatter is the standard deviation, and here are the steps to make a scatterplot of the standard deviation versus the means:

stats(quake$strength,by=quake$plate.ID) -> hold
plot(hold[2,],hold[3,],xlab='Mean Strength',
ylab='Standard Dev. of Strength')

This plot is made using the fact that stats returns a matrix with means and standard deviations in the the second and third rows. From the scatter plot there appears to be an increasing relationship. Higher mean levels are associated with higher amounts of scatter.

Meteorological measurements are a rich source of observational data sets. Here we analyze for illustration the data set raleigh.temp, a time series containing 40 years of average monthly temperatures for Raleigh, North Carolina, from 1949 to 1988. Recall that you can use view(raleigh.temp) to see the first 10 rows of the data set. The first step with any time series data is to plot the data over time.

set.panel(3,1)
plot(raleigh.temp$temp,type='l')

(Note: type='p' for ``points'' is the default, type='l' is for lines. It is easy to confuse the 'l' for a '1' which returns an error message. Type ex(plot) to see examples.) The plot window was divided up in to 3 parts so that the plot would come out long and skinny. This format helps to spot trends over time. It is also useful to join the points with line segments using type='l' in the plot command.

Having a temperate climate, Raleigh has a clear seasonal cycle for its monthly temperatures. The rest of this section will suggest some ways of studying the seasonal cycle more closely. A good start is to use dot diagrams and boxplots to compare the temperatures across different months.

lplot(raleigh.temp$month,raleigh.temp$temp,
ylab='Monthly Average Temperatures')
bplot(raleigh.temp$temp,by=raleigh.temp$month,
ylab='Monthly Average Temperatures')

Each month has 40 values, so that the simple dot diagram ( lplot) has too many points to be clear. The sequence of boxplots, however, helps to track the annual variation in temperature. Note that this cycle not only involves the mean level but also the amount of scatter about the center (e.g., consider the changes in whisker and box lengths). To reduce these distributions to a smaller number of statistics, one can just consider the means and standard deviations by month. The following sequence of plots considers the means and standard deviations separately and then looks for a possible relationship between average temperature for a month and the variability about this average. The plots will be organized on a single page to make comparisons easier.

set.panel(3,1)
stats(raleigh.temp$temp,by=raleigh.temp$month)-> hold
plot(hold[2,],ylab='Monthly Mean')
plot(hold[3,],ylab='Monthly S.D.')
plot(hold[2,],hold[3,],xlab='Mean',ylab='S.D.')

Does the pattern of standard deviations reflect your experience about variability in temperatures throughout the year?

What about trends over time? The large seasonal cycle makes it difficult to see smaller but steady trends over the 40 years. One way to do this is to just compute yearly averages and then plot them.

stats(raleigh.temp$temp,by=raleigh.temp$year) -> hold
plot(hold[2,])

Do you see any trends? Where were the periods of unusually hot weather?


On Your Own

  1. The data in ozone.sw are the monthly mean concentration (in Dobson units) of the ozone layer at Arosa, Switzerland, from 1926 to 1971 (from Andrews and Herzberg, 1985, p. 75-76). These data are in the form of a data frame just like the raleigh.temp data set discussed in the lab. Type view(ozone.sw) to see the first 10 rows. Note that there are some missing measurements denoted by NA in the listing.

    Describe how the ozone layer varies from month to month (e.g., bplot(ozone.sw$ozone,
    by=ozone.sw$month)
    ). Is the variability approximately the same from month to month (e.g., does the standard deviation of mean ozone thickness change from month to month)? (Just copy the code used in the analysis of raleigh.temp.) Is there evidence that the ozone layer is changing over time? (The missing values bias a few of the points in the early years.)

  2. Does an especially hot August follow a hot July in Raleigh? Investigate this question by making a scatterplot of the August and July temperatures. First break the data into months using:

    cat.to.list(raleigh.temp$temp,raleigh.temp$month) -> look

    The list data set look will have components jan, feb, etc., that contain each of the month's temperatures (in the correct time order). Then use plot(look$jul,look$aug). From the scatterplot make a rough estimate of how often Raleigh gets a summer where both July and August are scorchers. Make a similar analysis of the months January and February.

  3. The first three rows of the data set college are as follows:

          pop  inc undgrad  grad  fees loc
    ark  2399 15.4    85.7   7.0 1.540   s
    ala  4136 16.2   200.3  20.9 1.699   s
    geo  6751 18.1   237.3  31.8 1.763   s
    This is a sample of fifteen states with certain statistics taken from the Chronicle of Higher Education. All entries are in thousands so that Arkansas has a population of 2,399,000, a yearly per capita income of $15,400, 85,700 undergraduates students, 7,000 graduate students, and average cost of tuition and fees at public universities of $1,540, and is located in the south (s for south). Make a boxplot to compare the tuition and fees by region ( bplot(college$fees,by=college$loc)). Where is the cheapest region to go to school? Are the tuition and fees related to the per capita income? (Make a plot.) Is the per capita income related to the ratio of undergraduates to population? (Hint: for plots with the points labeled, use lplot with the 3rd argument row.names(college).)

  4. Suppose that a team of engineering students is attempting design a high quality paper airplane. They already have a basic design but are concerned about the weighting of the front nose. They consider planes with no weight (treatment 1), one paper clip (treatment 2), two paper clips (treatment 3), and three paper clips (treatment 4). They make 24 identical airplanes and randomly choose 6 for treatment 1, 6 for treatment 2, etc. They throw the planes and measure the distance flown. The results are in airplane.

    airplane

      treat 1 treat 2 treat 3 treat 4
    1    13.2    14.5    10.2     8.5
    2    14.2    16.3     8.4     6.7
    3    12.3    11.4     7.6     6.5
    4    14.3    17.3     9.1     7.4
    5    11.3    18.4    12.1     9.3
    6    10.3    15.3    13.2    10.5

    This type of designed experiment is called a completely randomized design (CRD) because the 24 experimental units were randomly assigned to the four treatments. Using the methods presented in this lab, comment on the differences among these four treatments.

  5. The data set michelson consists of 100 measurements made by Michelson in 1879 on the velocity of light in air. The data set has been split into five sets of 20 observations each (V1 is the first 20, V2 is the second 20, etc.). Use plots or statistics and comment on the change in mean and standard deviation over time.

    Working backwards from the current ``true value'' of the velocity of light in vacuum (299,792.5 km/sec) and using Michelson's own adjustment for the effect of air, the comparable ``true value'' for these data is 734.5. Comment on the accuracy and precision of Michelson's experiment. (``Accuracy'' relates to how close your measurements and estimate are to the true value you are trying to estimate, and ``precision'' relates to the variablility or repeatability of your measurements around your estimate.)

  6. The data set monarch.dat from Computer-Active Data Analysis by Lunn and McNeil (1991) contains the years lived after inauguration, election, or coronation of U.S. presidents, popes, and British monarchs from 1690 to roughly the present (Nixon, Carter, Reagan, etc., are not included). Do the groups differ? Make side-by-side boxplots and stats summaries. What do you conclude?

  7. Were Etruscans native Italians or immigrants from another land? The data set etruscan.dat contains the maximum skull width ( width, in mm) for 84 skulls of Etruscan males and 70 modern Italian males (from Medical Biology and Etruscan Origins, p. 136). Make boxplots (etruscan.dat -> z and bplot(z$width,by=z$group)) and numerical summaries. What do you conclude?