The most important concept is the distinction between experimental data and observational data. Experimental data are usually based on a careful design where some variables (factors) of interest are controlled or preset by the experimenter. Observational data are collected without the benefit of a controlled experimental environment. Most meteorological data are observational because there is no way to control weather variables. In this lab you will learn to recognize both types of data and to summarize their important features.
The data set drill.bit contains the results of testing two types of drill bits in the manufacture of compressors. There were two brands considered (made by Besley and Cleveland), and the measurements are the number of holes drilled until the bit breaks. The tests were done under the same manufacturing conditions, and the influence on performance due to factors other than the brand were minimized. Perhaps the simplest summary for a small data set is just plotting the values versus brand. Since the plot function cannot handle character values, we use the function lplot which is set up to plot numbers versus a character vector of categories:
lplot(drill.bit$brand,drill.bit$holes)
A better shaped plot often results from first using set.panel(2,1). Also in order to save typing we may first copy drill.bit to z
drill.bit -> z
and then just type
lplot(z$brand,z$holes)
From this plot it is clear that the Cleveland bits, on the average, tend to drill about half as many holes before breaking. But this relationship is not perfect: the best Cleveland bit actually drilled more holes than the poorest Besly bit.
One reason these data were collected was to evaluate the cost tradeoffs for these two brands. The prices for the bits are different and are reported as the third column of drill.bit (Besly $9.64 and Cleveland $4.48). In order to make a cost comparison it is also useful to consider the price per hole achieved by each drill bit. The dot plot above of the cost per hole was made by
lplot(z$brand,z$price/z$holes,
ylab='Price per Hole')
title('Price per Hole for Drill Bits')
By taking cost into account, it is not as obvious which brand is better. The Cleveland brand appears to be about one cent higher, but there is also a substantial amount of variability. For these plots it appears that the Cleveland brand had one or two bits with unusually large costs. To get a more precise description of the distributions, one can calculate some summary statistics of the costs for each brand.
stats(z$price/z$holes,by=z$brand)
besly cleveland
N 6.000000000 8.00000000
mean 0.028507448 0.03769031
Std.Dev. 0.006755421 0.01646174
min 0.021809955 0.01709924
Q1 0.023319190 0.03090796
median 0.026783969 0.03556452
Q3 0.032786746 0.03971490
max 0.038714859 0.07111111
missing values 0.000000000 0.00000000
The mean and medians measure the center of the distribution and thus correspond to locations of the middle of points in the dot diagrams. These estimates are reasonable judging from where points tend to cluster on the dot diagrams. The larger scatter for the Cleveland bits seen on the dot diagrams is confirmed by the larger standard deviation from stats.
Which is the better method for summarizing these data? A graphical presentation with dot diagrams or the table of statistics from stats ? The answer is that both are important and when used together give much more information than either one alone. The difference in means (.0377-.0285 = .009 dollars or .9 cents) provides an estimate of the cost advantage. However the dot diagrams indicate that the mean of the Cleveland bits may not be the right measure since its estimate can be influenced by a few large values. In fact in both cases the median costs are smaller than the means, but the difference in median costs is still about .9 cents.
For large data sets dot diagrams become difficult to interpret. Except for the extreme values, the plotted dots tend to pile up and obscure features of the data's distribution. Since summary statistics such as the mean or the percentiles have the same interpretation for a large data set as well as for a small one, we use boxplots to display some of the most important statistics. Boxplots show basic features of a distribution without plotting every point. The function bplot is designed to produce boxplots for several data sets lined up on the same scale. For the drill bit data one can use the by parameter to divide the costs into the two brands (note similarity to the by argument of stats):
bplot(z$price/z$holes,by=z$brand,
ylab='Price per Hole')
title('Price per Hole for Drill Bits')
Each box (see top of next page) is drawn between the 25th and 75th percentiles of the data, and a line is drawn at the median. Thus the box depicts the middle part of the distribution. In fact we know that only 25% of the values will lie above the box and 25% will lie below. To give a sense of the range of the data, lines called whiskers are drawn from the ends of the boxes to the 5th and 95th percentiles. The last step is to plot the values in the data set that lie beyond the range of the whiskers since the extreme values in a data set are often important. The drill bit data have been used as an example because we have already examined its distribution. The boxplots reflect the same information that we obtained from the dot diagram. The tight clustering of the middle four values for the Cleveland brand is reflected in the boxplot by the small box and large whiskers.
The bplot function can take several different types of data sets and produce sensible results. The example above shows how to describe the different groups using a by option. If zork is the name of a data frame or list, then bplot(zork) will line up boxplots for the different columns or components in zork using a common scale. The data sets can also be fed to bplot separately. If z1, z2, and z3 are individual data sets, then
bplot(z1,z2,z3)
will line up three boxplots on a common scale.
The next example is an experiment that has more complicated structure. Small propulsion units, called actuators, are used to maneuver space craft once they are in space. In order to control these motions accurately, the actuator needs to produce a precise amount of force. The data set actuator is an experiment to understand what factors effect the variability of the force produced by an actuator. The actuator is fired using compressed air, and the factors studied are the actuator used (act), the amount of pressure used (press ), the length of the air supply line ( line) and the nozzle type ( nozzle). The experimental results are formatted as a data frame.
actuator
act press line nozzle force order 1 A2 30psi 20ft straight 0.099202 11 2 A1 30psi 20ft straight 0.070762 5 3 A2 30psi 40ft straight 0.075433 12 4 A1 30psi 40ft straight 0.068982 8 5 A2 100psi 20ft straight 0.487464 3 6 A1 100psi 20ft straight 0.442802 2 7 A2 100psi 40ft straight 0.503937 9 8 A1 100psi 40ft straight 0.370022 6 9 A2 30psi 20ft right.ang 0.083024 7 10 A1 30psi 20ft right.ang 0.058082 15 11 A2 30psi 40ft right.ang 0.067684 1 12 A1 30psi 40ft right.ang 0.049787 4 13 A2 100psi 20ft right.ang 0.486403 10 14 A1 100psi 20ft right.ang 0.390816 16 15 A2 100psi 40ft right.ang 0.486725 14 16 A1 100psi 40ft right.ang 0.372554 13
Each of these factors has two different levels, and all 16
possible combinations were studied. Good experimental procedure is to
make these runs in a random order to reduce any bias due to a possible change
in lab conditions over time. The actual order that the data was
generated is reported in the last column of the data frame. For
example, the first run in this experiment was made on the second
actuator, at a pressure of 30 psi, using a 40 foot air supply line and
a right angle nozzle (see the
row). The resulting force was .067684 pounds.
A simple summary of these data is to make a series of boxplots by dividing the data based on each factor. For example, to look at the actuator factor:
bplot(actuator$force,by=actuator$act)
Now repeat this for each of the factors press, line, nozzle in order to see which ones make a difference on the response variable force.
Since we need to examine 4 factors, one could
put all the plots on a single page using
set.panel(2,2).
This setting will produce a
table of graphs.
This data set may seem confusing to analyze because all of the runs are different. The sequence of boxplots only considers one variable at a time, ignoring the levels of the other factors. Even though runs 11 and 5 use the same supply line pressure they use different actuators. How can one make a comparison if some of the factors are always at different levels? This property is actually a strength of the design and with the correct statistical techniques makes it possible to investigate how these factors interact with one another.
An interaction between two variables occurs when the observed response is not the sum of the effects from the individual factors. Although a thorough statistical analysis of this type of data will be deferred to Lab 14, one can still present the interactions among factors using plots. Based on a more complete statistical analysis, an interaction was identified between the actuator and the pressure. To examine this case we need to plot the force measurements against one of the factors but label the plotted points using the other factor:
lplot(actuator$act,actuator$force, actuator$press)
Here the force is plotted against the two actuator categories but the points are labeled with the pressure categories.
Based on this graph the difference in force between the two actuators depends strongly on what pressure level is being used. At high pressure levels there is a much larger discrepancy between the two actuators. Since consistency among actuators is desirable, this interaction is important but came as surprise to the experimenter. Before this data was collected, it was assumed that the actuators functioned in a very similar manner. A simple conclusion from this experiment might be that more consistent forces can be obtained across different actuators if one sticks to low pressures.
The data set quake contains the strengths of earthquakes measured at the earths continental plates. This data set has a very different character than testing actuators! The experimenter cannot specify where or when earthquakes will happen and cannot control the environment to make the experimental conditions similar. Once an earthquake occurs in a region, the geology of the region has changed slightly and subsequent earthquakes may be effected by this new orientation. Data that are not the result of a designed experiment are termed observational.
To see the structure of the data set quake, type
view(quake)
The function view is used here instead of just typing quake because this data set has 496 rows, and it is painful to watch 496 rows of data printed to the screen. view prints out the first 10 rows of a data set or the first k if you specify view(filename,k).
Much of the earth's seismic activity is due to motion of the large plates that make up the crust of the earth. Earthquakes occur when a buildup in tension between two layers of rock is suddenly released. For this reason many earthquakes occur at plate boundaries. The component strength in this data set gives the strength of a particular earthquake. The components lon and lat give the location of the event, and plate.ID is a numerical code for the plate boundary. To view some of the plate boundaries one can just add the locations of the earthquakes in this data set to a map of the world.
set.panel(1,1)
world()
points(quake$lon,quake$lat)
Note that the plate boundaries often occur in the middle of the
oceans but not surprising one is along the California coast. To put the plate ID on the map instead of just a point use text:
world()
text(quake$lon,quake$lat,quake$plate.ID)
A basic issue with these data is whether the strength of an earthquake depends on the plate boundary. Perhaps some plates are more dynamic than others. Lining up boxplots is an efficient way to summarize a large number of groups.
bplot(quake$strength,by=quake$plate.ID)
From these boxplots we see that the range of the strengths (whiskers) is much more variable that the center (median line) across different plate boundaries. Also the distributions tend to be asymmetric. The part of the box and the whisker above the median is longer than these features above the median. Thus, relatively speaking, smaller earthquakes are closer to the median than the largest ones.
In viewing these boxplots, keep in mind that the sample sizes differ among the plate groups. In fact when groups have less than 5 observations, a boxplot is not drawn and only the individuals points are plotted.
One feature of these data is the relationship between the scatter of the strengths and the mean level. One measure of scatter is the standard deviation, and here are the steps to make a scatterplot of the standard deviation versus the means:
stats(quake$strength,by=quake$plate.ID)
-> hold
plot(hold[2,],hold[3,],xlab='Mean Strength',
ylab='Standard Dev. of Strength')
This plot is made using the fact that stats returns a matrix with means and standard deviations in the the second and third rows. From the scatter plot there appears to be an increasing relationship. Higher mean levels are associated with higher amounts of scatter.
Meteorological measurements are a rich source of observational data sets. Here we analyze for illustration the data set raleigh.temp, a time series containing 40 years of average monthly temperatures for Raleigh, North Carolina, from 1949 to 1988. Recall that you can use view(raleigh.temp) to see the first 10 rows of the data set. The first step with any time series data is to plot the data over time.
set.panel(3,1)
plot(raleigh.temp$temp,type='l')
(Note: type='p' for ``points'' is the default, type='l' is for lines. It is easy to confuse the 'l' for a '1' which returns an error message. Type ex(plot) to see examples.) The plot window was divided up in to 3 parts so that the plot would come out long and skinny. This format helps to spot trends over time. It is also useful to join the points with line segments using type='l' in the plot command.
Having a temperate climate, Raleigh has a clear seasonal cycle for its monthly temperatures. The rest of this section will suggest some ways of studying the seasonal cycle more closely. A good start is to use dot diagrams and boxplots to compare the temperatures across different months.
lplot(raleigh.temp$month,raleigh.temp$temp,
ylab='Monthly Average Temperatures')
bplot(raleigh.temp$temp,by=raleigh.temp$month,
ylab='Monthly Average Temperatures')
Each month has 40 values, so that the simple dot diagram ( lplot) has too many points to be clear. The sequence of boxplots, however, helps to track the annual variation in temperature. Note that this cycle not only involves the mean level but also the amount of scatter about the center (e.g., consider the changes in whisker and box lengths). To reduce these distributions to a smaller number of statistics, one can just consider the means and standard deviations by month. The following sequence of plots considers the means and standard deviations separately and then looks for a possible relationship between average temperature for a month and the variability about this average. The plots will be organized on a single page to make comparisons easier.
set.panel(3,1)
stats(raleigh.temp$temp,by=raleigh.temp$month)-> hold
plot(hold[2,],ylab='Monthly Mean')
plot(hold[3,],ylab='Monthly S.D.')
plot(hold[2,],hold[3,],xlab='Mean',ylab='S.D.')
Does the pattern of standard deviations reflect your experience about variability in temperatures throughout the year?
What about trends over time? The large seasonal cycle makes it difficult to see smaller but steady trends over the 40 years. One way to do this is to just compute yearly averages and then plot them.
stats(raleigh.temp$temp,by=raleigh.temp$year) -> hold
plot(hold[2,])
Do you see any trends? Where were the periods of unusually hot weather?
Describe how the ozone layer varies from month to month (e.g.,
bplot(ozone.sw$ozone,
by=ozone.sw$month)).
Is the variability approximately the same from month to
month (e.g., does the standard deviation of mean ozone thickness
change from month to month)?
(Just copy the code used in the analysis of
raleigh.temp.)
Is there evidence
that the ozone layer is changing over time? (The missing values bias
a few of the points in the early years.)
cat.to.list(raleigh.temp$temp,raleigh.temp$month) -> look
The list data set look will have components jan, feb, etc., that contain each of the month's temperatures (in the correct time order). Then use plot(look$jul,look$aug). From the scatterplot make a rough estimate of how often Raleigh gets a summer where both July and August are scorchers. Make a similar analysis of the months January and February.
pop inc undgrad grad fees loc ark 2399 15.4 85.7 7.0 1.540 s ala 4136 16.2 200.3 20.9 1.699 s geo 6751 18.1 237.3 31.8 1.763 sThis is a sample of fifteen states with certain statistics taken from the Chronicle of Higher Education. All entries are in thousands so that Arkansas has a population of 2,399,000, a yearly per capita income of $15,400, 85,700 undergraduates students, 7,000 graduate students, and average cost of tuition and fees at public universities of $1,540, and is located in the south (s for south). Make a boxplot to compare the tuition and fees by region ( bplot(college$fees,by=college$loc)). Where is the cheapest region to go to school? Are the tuition and fees related to the per capita income? (Make a plot.) Is the per capita income related to the ratio of undergraduates to population? (Hint: for plots with the points labeled, use lplot with the 3rd argument row.names(college).)
airplane
treat 1 treat 2 treat 3 treat 4 1 13.2 14.5 10.2 8.5 2 14.2 16.3 8.4 6.7 3 12.3 11.4 7.6 6.5 4 14.3 17.3 9.1 7.4 5 11.3 18.4 12.1 9.3 6 10.3 15.3 13.2 10.5
This type of designed experiment is called a completely randomized design (CRD) because the 24 experimental units were randomly assigned to the four treatments. Using the methods presented in this lab, comment on the differences among these four treatments.
Working backwards from the current ``true value'' of the velocity of light in vacuum (299,792.5 km/sec) and using Michelson's own adjustment for the effect of air, the comparable ``true value'' for these data is 734.5. Comment on the accuracy and precision of Michelson's experiment. (``Accuracy'' relates to how close your measurements and estimate are to the true value you are trying to estimate, and ``precision'' relates to the variablility or repeatability of your measurements around your estimate.)