M-Lab 2: Describing and Comparing Two or More Data Sets

This lab will present some statistical and graphical tools for comparing two or more data sets. In learning about about these techniques, several different types of data will be used as examples.

The most important concept is the distinction between experimental data and observational data. Experimental data are usually based on a careful design where some variables (factors) of interest are controlled or preset by the experimenter. Observational data are collected without the benefit of a controlled experimental environment. Most meteorological data are observational because there is no way to control weather variables. In this lab you will learn to recognize both types of data and to summarize their important features.

Remember that from within MATLAB you can view the most important M-Lab functions by typing

help mlab

Or you might want to have a netscape window open to the appendix listing of the functions. Only a few new functions are introduced in this lab.

Experimental Data

The data set drill contains the results of testing two types of drill bits in the manufacture of compressors. There were two brands considered (made by Besley and Cleveland), and the measurements are the number of holes drilled until the bit breaks. The tests were done under the same manufacturing conditions, and the influence on performance due to factors other than the brand were minimized. Perhaps the simplest summary for a small data set is just plotting the values versus brand. Since the plot command cannot handle character values, we have created a new function lplot that can accept a character variable in its first argument:

lplot(drill.brand,drill.holes)

To add some labels, use the xlabel() and ylabel() functions:

xlabel('Brand')
ylabel('Number of Holes Drilled')


From this plot it is clear that the Cleveland bits, on the average, tend to drill about half as many holes before breaking. But this relationship is not perfect: the best Cleveland bit actually drilled more holes than the poorest Besly bit.

One reason these data were collected was to evaluate the cost tradeoffs for these two brands. The prices for the bits are different and are reported as the third column of drill (viewdata(drill) shows that Besly is $9.64 and Cleveland is $4.48). In order to make a cost comparison it is also useful to consider the price per hole achieved by each drill bit. For ease of typing let's first copy drill to z,

z=drill;

Then plot price per hole versus brand:

lplot(z.brand,z.price ./ z.holes)
xlabel('Brand')
ylabel('Price per Hole')
title('Price per Hole for Drill Bits')


Note how we took advantage of MATLAB's element-wise matrix division operator ./ to produce the cost per hole vector for the plot.

By taking cost into account, it is not as obvious which brand is better. The Cleveland brand appears to be about one cent higher, but there is also a substantial amount of variability. For these plots it appears that the Cleveland brand had one or two bits with unusually large costs. To get a more precise description of the distributions, one can calculate some summary statistics of the costs for each brand. Note that the second argument of the following stats command gives the summary statistics by brand:

>> stats(z.price ./ z.holes,z.brand)

               besly  cleveland
N          6.0000000   8.000000
Mean       0.0285070   0.037690
Std. Dev.  0.0067554   0.016462

Q1         0.0223450   0.025858
Median     0.0267840   0.035565
Q3         0.0355000   0.045702

Min        0.0218100   0.017099
Max        0.0387150   0.071111
Range      0.0169050   0.054012

The mean and medians measure the center of the distribution and thus correspond to locations of the middle of points in the plots. These estimates are reasonable judging from where points tend to cluster on the plots. The larger scatter for the Cleveland bits seen on the plots is confirmed by the larger standard deviation from stats.

Which is the better method for summarizing these data? A graphical presentation or a table of statistics? The answer is that both are important and when used together give much more information than either one alone. The difference in means (.0377-.0285 = .009 dollars or .9 cents) provides an estimate of the cost advantage. However the plot indicates that the mean of the Cleveland bits may not be the right measure since its estimate can be influenced by a few large values. In fact in both cases the median costs are smaller than the means, but the difference in median costs is still about .9 cents.

Boxplots

For large data sets data plots become difficult to interpret. Except for the extreme values, the plotted points tend to pile up and obscure features of the data's distribution. Since summary statistics such as the mean or the percentiles have the same interpretation for a large data set as well as for a small one, we use boxplots to display some of the most important statistics. Boxplots show basic features of a distribution without plotting every point. The function bplot is designed to produce boxplots for several data sets lined up on the same scale.

bplot(z.price ./ z.holes,z.brand)
xlabel('Brand')
ylabel('Price per Hole')
title('Price per Hole for Drill Bits')


Each box is drawn between the 25th and 75th percentiles of the data, and a line is drawn at the median. (The definition of the sample 25th and 75th percentiles are slightly different from those in stats.) Thus the box depicts the middle part of the distribution. The drill bit data have been used as an example because we have already examined their distribution. The boxplots reflect the same information that we obtained from the original plot. The tight clustering of the middle four values for the Cleveland brand is reflected in the boxplot by the small box and large whiskers.


A Factorial Design

The next example is an experiment that has more complicated structure. Small propulsion units, called actuators, are used to maneuver space craft once they are in space. In order to control these motions accurately, the actuator needs to produce a precise amount of force. The data set actuator is an experiment to understand what factors affect the variability of the force produced by an actuator. The actuator is fired using compressed air, and the factors studied are the actuator used (act), the amount of pressure used (press ), the length of the air supply line (line) and the nozzle type (nozzle). The experimental results are formatted as a structure array. So type viewdata(actuator) to see the full data set (you can learn more about the function viewdata by typing help viewdata).

viewdata(actuator)

Obs   act    press   line     nozzle      force   order

  1    A2    30psi   20ft   straight   0.099202      11
  2    A1    30psi   20ft   straight   0.070762       5
  3    A2    30psi   40ft   straight   0.075433      12
  4    A1    30psi   40ft   straight   0.068982       8
  5    A2   100psi   20ft   straight   0.487460       3
  6    A1   100psi   20ft   straight   0.442800       2
  7    A2   100psi   40ft   straight   0.503940       9
  8    A1   100psi   40ft   straight   0.370020       6
  9    A2    30psi   20ft   rightang   0.083024       7
 10    A1    30psi   20ft   rightang   0.058082      15
 11    A2    30psi   40ft   rightang   0.067684       1
 12    A1    30psi   40ft   rightang   0.049787       4
 13    A2   100psi   20ft   rightang   0.486400      10
 14    A1   100psi   20ft   rightang   0.390820      16
 15    A2   100psi   40ft   rightang   0.486730      14
 16    A1   100psi   40ft   rightang   0.372550      13

Each of these factors has two different levels, and all 16 possible combinations were studied. Good experimental procedure is to make these runs in a random order to reduce any bias due to a possible change in lab conditions over time. The actual order in which the data were generated is reported in the last column of the data frame. For example, the first run in this experiment was made on the second actuator, at a pressure of 30 psi, using a 40 foot air supply line and a right angle nozzle (see the row). The resulting force was .067684 pounds.

A simple summary of these data is to make a series of boxplots by dividing the data based on each factor. For example,

subplot(2,2,1)
bplot(actuator.force,actuator.act)
title('Actuator Force by Actuator')


Now repeat this for each of the factors press, line, nozzle in order to see which ones make a difference on the response variable force. For example, the next subplot would be

subplot(2,2,2)
bplot(actuator.force,actuator.press)
title('Actuator Force by Pressure')


This data set may seem confusing to analyze because all of the runs are different. The sequence of boxplots only considers one variable at a time, ignoring the levels of the other factors. Even though runs 11 and 5 use the same supply line pressure they use different actuators. How can one make a comparison if some of the factors are always at different levels? This property is actually a strength of the design and with the correct statistical techniques makes it possible to investigate how these factors interact with one another. We will look at this data again in a later lab.


Observational Data

The data set quake contains the strengths of earthquakes measured at the earths continental plates. This data set has a very different character than testing actuators! The experimenter cannot specify where or when earthquakes will happen and cannot control the environment to make the experimental conditions similar. Once an earthquake occurs in a region, the geology of the region has changed slightly and subsequent earthquakes may be affected by this new orientation. Data that are not the result of a designed experiment are termed observational.

To see the structure of the data set quake, type

viewdata(quake,5)

Much of the earth's seismic activity is due to motion of the large plates that make up the crust of the earth. Earthquakes occur when a buildup in tension between two layers of rock is suddenly released. For this reason many earthquakes occur at plate boundaries. The component strength in this data set gives the strength of a particular earthquake. The components lon and lat give the location of the event, and plateid is a numerical code for the plate boundary. A basic issue with these data is whether the strength of an earthquake depends on the plate boundary. Perhaps some plates are more dynamic than others. One simple plot is a sequence of boxplots of strength for each plate boundary:

bplot(quake.strength,quake.plateid)

Lining up boxplots is an efficient way to summarize a large number of groups. From these boxplots we see that the range of the strengths (height of the boxes) is much more variable that the center (median line) across different plate boundaries.

Meteorological measurements are also a rich source of observational data sets. Here we analyze for illustration the data set raleigh.temp, a time series containing 40 years of average monthly temperatures for Raleigh, North Carolina, from 1949 to 1988. The first step with any time series data is to plot the data over time.

plot(raleigh.temp.temp)

Having a temperate climate, Raleigh has a clear seasonal cycle for its monthly temperatures, but apparently no obvious long term trend. To look at the seasonal trend in a different way, let's make a sequence of boxplots for each month (this may take a few minutes on slower machines, alternatively just click to see the figure figure):

bplot(raleigh.temp.temp,raleigh.temp.month)

Note that this cycle not only involves the mean level but also the amount of scatter about the center (e.g., consider the changes in box and whisker lengths).


On Your Own

  1. The data in ozone are the monthly mean concentration (in Dobson units) of the ozone layer at Arosa, Switzerland, from 1926 to 1971 (from Andrews and Herzberg, 1985, p. 75-76). These data are in the form of a structure array just like the raleigh.temp data set discussed in the lab. Type viewdata(ozone,10) to see the first 10 rows. Describe how the ozone layer varies from month to month (e.g., bplot(ozone.ozone,ozone.month), etc., but be patient, bplot and stats are a bit slow here). Is the variability approximately the same from month to month ( look at the length of the boxes)? Is there evidence that the ozone layer is changing over time?

  2. The first three rows of the data set college are as follows:

    >> viewdata(college,3)
    
    Obs   school    pop    inc   undergrad   graduate    fees   loc
    
      1      ark   2399   15.4        85.7        7.0   1.540     s
      2      ala   4136   16.2       200.3       20.9   1.699     s
      3      geo   6751   18.1       237.3       31.8   1.763     s
    
    This is a sample of fifteen states with certain statistics taken from the Chronicle of Higher Education. All entries are in thousands so that Arkansas has a population of 2,399,000, a yearly per capita income of $15,400, 85,700 undergraduates students, 7,000 graduate students, and average cost of tuition and fees at public universities of $1,540, and is located in the south (s for south). Make a boxplot to compare the tuition and fees by region (e.g., bplot(college.fees,college.loc)). Where is the cheapest region to go to school? Use the stats function get the means of the fees by region.

  3. Suppose that a team of engineering students is attempting design a high quality paper airplane. They already have a basic design but are concerned about the weighting of the front nose. They consider planes with no weight (treatment 1), one paper clip (treatment 2), two paper clips (treatment 3), and three paper clips (treatment 4). They make 24 identical airplanes and randomly choose 6 for treatment 1, 6 for treatment 2, etc. They throw the planes and measure the distance flown. The results are in airplane.

    viewdata(airplane,8)

    >> viewdata(airplane,8)
    
    Obs   distance   treatment
    
      1       13.2      treat1
      2       14.2      treat1
      3       12.3      treat1
      4       14.3      treat1
      5       11.3      treat1
      6       10.3      treat1
      7       14.5      treat2
      8       16.3      treat2
    

    This type of designed experiment is called a completely randomized design (CRD) because the 24 experimental units were randomly assigned to the four treatments. Using the methods presented in this lab, comment on the differences among these four treatments.

  4. The data set michelson consists of 100 measurements made by Michelson in 1879 on the velocity of light in air. The data set has been split into five sets of 20 observations each (V1 is the first 20, V2 is the second 20, etc.). Use plots or statistics and comment on the change in mean and standard deviation over time. Working backwards from the current ``true value'' of the velocity of light in vacuum (299,792.5 km/sec) and using Michelson's own adjustment for the effect of air, the comparable ``true value'' for these data is 734.5. Comment on the accuracy and precision of Michelson's experiment. (``Accuracy'' relates to how close your measurements and estimate are to the true value you are trying to estimate, and ``precision'' relates to the variablility or repeatability of your measurements around your estimate.)

  5. The data set monarch from Computer-Active Data Analysis by Lunn and McNeil (1991) contains the years lived after inauguration, election, or coronation of U.S. presidents, popes, and British monarchs from 1690 to roughly the present (Nixon, Carter, Reagan, etc., are not included). Do the groups differ? Make side-by-side boxplots and stats summaries. What do you conclude?

  6. Were Etruscans native Italians or immigrants from another land? The data set etruscan contains the maximum skull width ( width, in mm) for 84 skulls of Etruscan males and 70 modern Italian males (from Medical Biology and Etruscan Origins, p. 136). Make boxplots and numerical summaries. What do you conclude?

  7. The data set draft contains average lottery numbers by month for the 1970 Draft Lottery (randomly drawn in Dec. 1969). These numbers determined the order by which draft age youth were drafted in 1970. For example, a person whose birthday received number 63 was drafted fairly early in 1970; a person with number 300 was not drafted at all. Recall that you can see the data by typing viewdata(draft). One of the variables, draft.order, breaks the months into two groups: "first" is for January through June and "second" is for July through December. To graphically display the two groups of monthly average lottery numbers type lplot(draft.order,draft.lottnum) and bplot(draft.lottnum,draft.order). If the draft numbers (1-366) were actually random, then monthly averages should be close to 183.5. What do these displays suggest? You could also use numerical summaries via stats(draft.lottnum,draft.order). What can you conclude about the randomness of the lottery numbers?