The most important concept is the distinction between experimental data and observational data. Experimental data are usually based on a careful design where some variables (factors) of interest are controlled or preset by the experimenter. Observational data are collected without the benefit of a controlled experimental environment. Most meteorological data are observational because there is no way to control weather variables. In this lab you will learn to recognize both types of data and to summarize their important features.
Remember that from within MATLAB you can view the most important M-Lab functions by typing
help mlab
Or you might want to have a netscape window open to the appendix listing of the functions. Only a few new functions are introduced in this lab.
The data set
drill
contains the results of testing two types
of drill bits in the manufacture of
compressors. There were two brands considered (made by Besley and Cleveland), and
the measurements are the number of holes drilled until the bit
breaks. The tests were done under the same manufacturing conditions,
and the influence on performance due to factors other than the brand were minimized.
Perhaps the simplest summary for a small data set is just plotting the values
versus brand. Since the
plot
command cannot handle character values, we have created a new function
lplot
that can accept a character variable in its first argument:
lplot(drill.brand,drill.holes)
To add some labels, use the
xlabel() and
ylabel()
functions:
xlabel('Brand')
ylabel('Number of Holes Drilled')
From this plot it is clear that the Cleveland bits, on the average,
tend to drill about half as many holes before breaking. But this
relationship is not perfect: the best Cleveland bit actually drilled
more holes than the poorest Besly bit.
One reason these data were collected was to evaluate the cost
tradeoffs for these two brands. The prices for the bits are different
and are reported as the third column of
drill
(viewdata(drill)
shows that
Besly is $9.64 and
Cleveland is $4.48). In order to make a cost comparison it is also useful
to consider the price per hole achieved by each drill bit. For
ease of typing let's first copy
drill to
z,
z=drill;
Then plot price per hole versus brand:
lplot(z.brand,z.price ./ z.holes)
xlabel('Brand')
ylabel('Price per Hole')
title('Price per Hole for Drill Bits')
Note how we took advantage of MATLAB's element-wise matrix division
operator ./
to produce the cost per hole vector for the plot.
By taking cost into account, it is not as obvious which brand is better. The Cleveland brand appears to be about one cent higher, but there is also a substantial amount of variability. For these plots it appears that the Cleveland brand had one or two bits with unusually large costs. To get a more precise description of the distributions, one can calculate some summary statistics of the costs for each brand. Note that the second argument of the following stats command gives the summary statistics by brand:
>> stats(z.price ./ z.holes,z.brand)
besly cleveland
N 6.0000000 8.000000
Mean 0.0285070 0.037690
Std. Dev. 0.0067554 0.016462
Q1 0.0223450 0.025858
Median 0.0267840 0.035565
Q3 0.0355000 0.045702
Min 0.0218100 0.017099
Max 0.0387150 0.071111
Range 0.0169050 0.054012
The mean and medians measure the center of the distribution and thus correspond to locations of the middle of points in the plots. These estimates are reasonable judging from where points tend to cluster on the plots. The larger scatter for the Cleveland bits seen on the plots is confirmed by the larger standard deviation from stats.
Which is the better method for summarizing these data? A graphical presentation or a table of statistics? The answer is that both are important and when used together give much more information than either one alone. The difference in means (.0377-.0285 = .009 dollars or .9 cents) provides an estimate of the cost advantage. However the plot indicates that the mean of the Cleveland bits may not be the right measure since its estimate can be influenced by a few large values. In fact in both cases the median costs are smaller than the means, but the difference in median costs is still about .9 cents.
For large data sets data plots become difficult to interpret. Except
for the extreme values, the plotted points tend to pile up and obscure
features of the data's distribution. Since summary statistics such as
the mean or the percentiles have the same interpretation for a large
data set as well as for a small one, we use boxplots to display some
of the most important statistics. Boxplots show basic features of a
distribution without plotting every point. The function
bplot is
designed to produce boxplots for several data sets lined up on the
same scale.
bplot(z.price ./ z.holes,z.brand)
xlabel('Brand')
ylabel('Price per Hole')
title('Price per Hole for Drill Bits')
Each box is drawn between the 25th and 75th
percentiles of the data, and a line is drawn at the median.
(The definition of the sample 25th and 75th percentiles are slightly
different from those in
stats.)
Thus the
box depicts the middle part of the distribution.
The drill bit data have been used as an example because we
have already examined their distribution.
The boxplots reflect the same
information that we obtained from the original plot. The tight
clustering of the middle four values for the Cleveland brand is
reflected in the boxplot by the small box and large whiskers.
The next example is an experiment that has more complicated structure. Small propulsion units, called actuators, are used to maneuver space craft once they are in space. In order to control these motions accurately, the actuator needs to produce a precise amount of force. The data set actuator is an experiment to understand what factors affect the variability of the force produced by an actuator. The actuator is fired using compressed air, and the factors studied are the actuator used (act), the amount of pressure used (press ), the length of the air supply line (line) and the nozzle type (nozzle). The experimental results are formatted as a structure array. So type viewdata(actuator) to see the full data set (you can learn more about the function viewdata by typing help viewdata).
viewdata(actuator)
Obs act press line nozzle force order 1 A2 30psi 20ft straight 0.099202 11 2 A1 30psi 20ft straight 0.070762 5 3 A2 30psi 40ft straight 0.075433 12 4 A1 30psi 40ft straight 0.068982 8 5 A2 100psi 20ft straight 0.487460 3 6 A1 100psi 20ft straight 0.442800 2 7 A2 100psi 40ft straight 0.503940 9 8 A1 100psi 40ft straight 0.370020 6 9 A2 30psi 20ft rightang 0.083024 7 10 A1 30psi 20ft rightang 0.058082 15 11 A2 30psi 40ft rightang 0.067684 1 12 A1 30psi 40ft rightang 0.049787 4 13 A2 100psi 20ft rightang 0.486400 10 14 A1 100psi 20ft rightang 0.390820 16 15 A2 100psi 40ft rightang 0.486730 14 16 A1 100psi 40ft rightang 0.372550 13
Each of these factors has two different levels, and all 16
possible combinations were studied. Good experimental procedure is to
make these runs in a random order to reduce any bias due to a possible change
in lab conditions over time. The actual order in which the data were
generated is reported in the last column of the data frame.
For
example, the first run in this experiment was made on the second
actuator, at a pressure of 30 psi, using a 40 foot air supply line and
a right angle nozzle (see the
row). The resulting force
was .067684 pounds.
A simple summary of these data is to make a series of boxplots by
dividing the data based on each factor. For example,
subplot(2,2,1)
bplot(actuator.force,actuator.act)
title('Actuator Force by Actuator')
Now repeat this for each of the factors press, line, nozzle in order
to see which ones make a difference on the response variable force.
For example, the next subplot would be
subplot(2,2,2)
bplot(actuator.force,actuator.press)
title('Actuator Force by Pressure')
This data set may seem confusing to analyze because all of the runs are different. The sequence of boxplots only considers one variable at a time, ignoring the levels of the other factors. Even though runs 11 and 5 use the same supply line pressure they use different actuators. How can one make a comparison if some of the factors are always at different levels? This property is actually a strength of the design and with the correct statistical techniques makes it possible to investigate how these factors interact with one another. We will look at this data again in a later lab.
The data set quake contains the strengths of earthquakes measured at the earths continental plates. This data set has a very different character than testing actuators! The experimenter cannot specify where or when earthquakes will happen and cannot control the environment to make the experimental conditions similar. Once an earthquake occurs in a region, the geology of the region has changed slightly and subsequent earthquakes may be affected by this new orientation. Data that are not the result of a designed experiment are termed observational.
To see the structure of the data set quake, type
viewdata(quake,5)
Much of the earth's seismic activity is due to motion of the large
plates that make up the crust of the earth. Earthquakes occur when a
buildup in tension between two layers of rock is suddenly released.
For this reason many earthquakes occur at plate boundaries. The
component strength
in this data set gives the strength of a
particular earthquake. The components
lon and
lat give the
location of the event, and
plateid is a numerical code for the
plate boundary.
A basic issue with these data is whether the strength of an earthquake
depends on the plate boundary. Perhaps some plates are more dynamic than
others. One simple plot is a sequence of boxplots of strength for each plate
boundary:
bplot(quake.strength,quake.plateid)
Lining up boxplots is an efficient way to summarize a large number of
groups.
From these boxplots we see that the range of the strengths (height
of the boxes)
is much more variable that the center (median line) across different
plate boundaries.
Meteorological measurements are also a rich source of observational data
sets. Here we analyze for illustration the data set
raleigh.temp,
a time series containing 40 years of average monthly temperatures for
Raleigh, North Carolina, from 1949 to 1988. The first step with any
time series data is to plot the data over time.
plot(raleigh.temp.temp)
Having a temperate climate, Raleigh has a clear seasonal cycle for its
monthly temperatures, but apparently no obvious long term trend.
To look at the seasonal trend in a different way, let's make a sequence
of boxplots for each month (this may take a few minutes on slower
machines, alternatively just click to see the figure
figure):
bplot(raleigh.temp.temp,raleigh.temp.month)
Note that this cycle not
only involves the mean level but also the amount of scatter about the
center (e.g., consider the changes in box and whisker lengths).
>> viewdata(college,3) Obs school pop inc undergrad graduate fees loc 1 ark 2399 15.4 85.7 7.0 1.540 s 2 ala 4136 16.2 200.3 20.9 1.699 s 3 geo 6751 18.1 237.3 31.8 1.763 sThis is a sample of fifteen states with certain statistics taken from the Chronicle of Higher Education. All entries are in thousands so that Arkansas has a population of 2,399,000, a yearly per capita income of $15,400, 85,700 undergraduates students, 7,000 graduate students, and average cost of tuition and fees at public universities of $1,540, and is located in the south (s for south). Make a boxplot to compare the tuition and fees by region (e.g., bplot(college.fees,college.loc)). Where is the cheapest region to go to school? Use the stats function get the means of the fees by region.
viewdata(airplane,8)
>> viewdata(airplane,8) Obs distance treatment 1 13.2 treat1 2 14.2 treat1 3 12.3 treat1 4 14.3 treat1 5 11.3 treat1 6 10.3 treat1 7 14.5 treat2 8 16.3 treat2
This type of designed experiment is called a completely randomized design (CRD) because the 24 experimental units were randomly assigned to the four treatments. Using the methods presented in this lab, comment on the differences among these four treatments.