Suppose that pairs of measurements are made where the mean of
one variable is a linear function of the other. For example,
suppose that Y is the average minimum January (AMJ) temperature
for a city in the US, and X is its latitude. These two variables
have been collected at the locations of the 50 largest cities in
the US and are in the S data sets
climate$jan and
climate$lat. Examining a scatter plot of AJM temperature versus
latitude, one can see that these variables are related, and moreover
the relationship appears to be linear. A reasonable model for AMJ
temperature is that on the average, it is equal to
where X is the latitude.
One problem with this fairly vague description is that it does
not provide enough information to estimate
and
.
Here is a more precise statistical model. Let
,
...,
denote the 50 pairs of latitude and AJM
temperatures. Then we will assume that
The random components
are assumed to be independent random variables with
zero mean (
) and a constant variance
(
). This model explains
the variation in the AJM
temperatures as having two components: it is a linear
function of the latitude plus a random perturbation. The
combination of the two will give a scatterplot that appears roughly
along a line but with some random departures from one case to
another. This viewpoint can be exploited to estimate the linear
relationship in an objective manner.
For any choice of slope and intercept (
and
),
one can predict the AMJ temperature from the latitude value. The
idea of least squares is to find the choice of slope and intercept
that give the best fit among the observed AMJ temperatures and
those predicted from the latitudes. Here is how this works in
mathematical notation. Let
Then the least squares estimates of
and
are the
values that minimize this sum of squares. For simple models and
small data sets, the estimates of
and
have a
formula that can be evaluated using a calculator.
However, our interest is in using S to do the
calculations so that we can look at larger data sets and eventually
fit more complicated curves.
The S function
lm takes as its main argument a formula for
the data. For variables Y and X and the model
the function call would be
lm(Y ~ X).
There are three significant differences between this formula and the
model:
~) is used in place of an equal
sign.
lm(climate$jan
~ climate$lat) -> zork
plot(climate$lat,climate$jan)
seq(21,48,,30) -> x
zork$coef[1] + zork$coef[2]*x -> y
lines(x,y)
The command seq(21,48,,30) -> x puts 30 points between 21 and 48 into the vector x. Then the points from the least squares line are put in y. lines(x,y) adds the line to the scatterplot. If you want to know what the components of zork are, just type names(zork). (In the above plot a simpler way to add the fitted line to the scatterplot is to replace the last three commands by lines(climate$lat,zork$fitted.values).)
The linear models function lm returns a fairly complicated list of things as the result of the least squares fit. The most important of these is perhaps the vector of estimated parameters in the component coef. Remember that the constant function is automatically included in the model. The other parameters in the model are in the order that they are specified. Thus, zork$coef has two values, the estimated intercept first and then the estimated slope corresponding to latitude.
The S function summary is used to summarize this output. In this case summary(zork) will list the parameters in a table along with error limits indicating their accuracy. The results of summarizing this regression are as follows:
summary(zork)
Call: lm(formula = climate$jan ~ climate$lat)
Residuals:
Min 1Q Median 3Q Max
-12.59 -5.69 -2.327 6.515 24.26
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 99.9955 8.6087 11.6157 0.0000
climate$lat -1.8898 0.2310 -8.1804 0.0000
Residual standard error: 8.467 on 48 degrees of freedom
Multiple R-Squared: 0.5823
Correlation of Coefficients:
(Intercept)
climate$lat -0.9903
There are several
other useful components in zork
.
In addition to plotting the estimated curve on top of the data in a
scatter plot, it is important to plot the residuals. The residuals
estimate the random component of the model. If the model actually
fits the data well, the residuals should appear randomly
distributed and not have any patterns. In this case the standard
deviation of the residuals estimates
.
If the residuals do have some pattern or unusual values, then the model is possibly incorrect. In this case, one should not try to draw any conclusion about the estimated curve because the simple least squares procedure is not appropriate. Note, though, that the residuals always have a mean of zero due to the least squares fitting process, and therefore we often draw a reference horizontal line at y=0. An example with the climate data is as follows:
lplot(climate$lat,zork$residuals,1:50)
title("Residuals Coded by the Order of the Data Points")
yline(0)
Note that lplot in place of plot numbers the data points in the order in which they come.
Next let us try two artificial examples with ``funny'' residuals. The first is an attempt to fit a linear function to one that is really a quadratic polynomial.
seq(0,2,,30) -> x
rnorm(30,mean = 0,sd = .2) -> error
x + x**2 + error -> y
plot(x,y)
lm(y ~ x) -> out
plot (x,out$residuals)
yline(0)
The resulting residual plot will look far from random and exhibit an obvious curve.
Another case is when one or more values are unusual relative to the others. To illustrate this situation, first just use the previous x and error values to generate a data set that follows the linear model correctly.
1.0 + 3.0*x + error -> y
lm(y ~ x) -> out
plot(x,out$residuals)
yline(0)
This residual plot should look random. There will be no obvious pattern to the scatter in the values of the residuals. Now add an ``outlier:'' replace the 15th y value with something much larger, say 10.2.
10.2 -> y[15]
Repeating the lm and plot steps above gives a residual plot where the 15th value deviates from zero significantly more than the others. Obviously by our construction, it should not have been there in the first place! Type plot(x,y) to see how ``out of place'' the point looks.
Sometimes you would like to recalculate the least squares fit having omitted a data point. This process is now illustrated with the climate data.
climate$jan[c(-38,-45)] ->
jan.new
climate$lat[c(-38,-45)] -> lat.new
lm(jan.new ~ lat.new) -> zork.new
Here the 38th and 45th points have been omitted. What is special about these cities? ( climate$city lists the names.) Can you give some geographic or climatological reasons for why these data points are unusual?
insulate.dat -> z
lplot(z$temp,z$gas,z$insulation)
Then create two data sets, z[1:26,]->z1, z[27:56,]->z2, and fit simple linear models of gas versus temp for each data set. Overlay the fitted lines on the above plot with the lines command. Also plot the residuals. Are there any patterns or unusual points? What conclusions can you draw about insulating a house?