Lab 4: Estimating a Linear Relationship

Much of the practical use of statistics involves modeling and estimating the relationship among two or more variables. We will concentrate here on simple linear regression which is the case of just two variables, Y and X. We first discuss the statistical model which relates the dependent variable Y to the independent variable X, and then introduce the curve fitting method called least squares.

A Statistical Model for a Linear Relationship

Suppose that pairs of measurements are made where the mean of one variable is a linear function of the other. For example, suppose that Y is the average minimum January (AMJ) temperature for a city in the US, and X is its latitude. These two variables have been collected at the locations of the 50 largest cities in the US and are in the S data sets climate$jan and climate$lat. Examining a scatter plot of AJM temperature versus latitude, one can see that these variables are related, and moreover the relationship appears to be linear. A reasonable model for AMJ temperature is that on the average, it is equal to where X is the latitude.

One problem with this fairly vague description is that it does not provide enough information to estimate and . Here is a more precise statistical model. Let , ..., denote the 50 pairs of latitude and AJM temperatures. Then we will assume that

The random components are assumed to be independent random variables with zero mean ( ) and a constant variance ( ). This model explains the variation in the AJM temperatures as having two components: it is a linear function of the latitude plus a random perturbation. The combination of the two will give a scatterplot that appears roughly along a line but with some random departures from one case to another. This viewpoint can be exploited to estimate the linear relationship in an objective manner.



Least Squares Estimates

For any choice of slope and intercept ( and ), one can predict the AMJ temperature from the latitude value. The idea of least squares is to find the choice of slope and intercept that give the best fit among the observed AMJ temperatures and those predicted from the latitudes. Here is how this works in mathematical notation. Let

Then the least squares estimates of and are the values that minimize this sum of squares. For simple models and small data sets, the estimates of and have a formula that can be evaluated using a calculator. However, our interest is in using S to do the calculations so that we can look at larger data sets and eventually fit more complicated curves.



The S Function lm

The S function lm takes as its main argument a formula for the data. For variables Y and X and the model the function call would be lm(Y ~ X). There are three significant differences between this formula and the model:

Now let us see how to use lm. Here are the S commands to estimate a least squares line for the climate data and then add this line to a scatterplot.

lm(climate$jan ~ climate$lat) -> zork
plot(climate$lat,climate$jan)
seq(21,48,,30) -> x
zork$coef[1] + zork$coef[2]*x -> y
lines(x,y)

The command seq(21,48,,30) -> x puts 30 points between 21 and 48 into the vector x. Then the points from the least squares line are put in y. lines(x,y) adds the line to the scatterplot. If you want to know what the components of zork are, just type names(zork). (In the above plot a simpler way to add the fitted line to the scatterplot is to replace the last three commands by lines(climate$lat,zork$fitted.values).)

The linear models function lm returns a fairly complicated list of things as the result of the least squares fit. The most important of these is perhaps the vector of estimated parameters in the component coef. Remember that the constant function is automatically included in the model. The other parameters in the model are in the order that they are specified. Thus, zork$coef has two values, the estimated intercept first and then the estimated slope corresponding to latitude.

The S function summary is used to summarize this output. In this case summary(zork) will list the parameters in a table along with error limits indicating their accuracy. The results of summarizing this regression are as follows:

summary(zork)

Call: lm(formula = climate$jan ~ climate$lat)
Residuals:
    Min    1Q Median    3Q   Max
 -12.59 -5.69 -2.327 6.515 24.26

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  99.9955   8.6087    11.6157   0.0000
climate$lat  -1.8898   0.2310    -8.1804   0.0000

Residual standard error: 8.467 on 48 degrees of freedom
Multiple R-Squared: 0.5823

Correlation of Coefficients:
            (Intercept)
climate$lat -0.9903
There are several other useful components in zork . Note that the residuals plus the predicted values equal the original Y values. It should be pointed out that the choice of zork is arbitrary in this example. But the names of the components and the use of the dollar sign cannot be altered.


Scrutinizing the Residuals

In addition to plotting the estimated curve on top of the data in a scatter plot, it is important to plot the residuals. The residuals estimate the random component of the model. If the model actually fits the data well, the residuals should appear randomly distributed and not have any patterns. In this case the standard deviation of the residuals estimates .

If the residuals do have some pattern or unusual values, then the model is possibly incorrect. In this case, one should not try to draw any conclusion about the estimated curve because the simple least squares procedure is not appropriate. Note, though, that the residuals always have a mean of zero due to the least squares fitting process, and therefore we often draw a reference horizontal line at y=0. An example with the climate data is as follows:

lplot(climate$lat,zork$residuals,1:50)
title("Residuals Coded by the Order of the Data Points")
yline(0)

Note that lplot in place of plot numbers the data points in the order in which they come.

Next let us try two artificial examples with ``funny'' residuals. The first is an attempt to fit a linear function to one that is really a quadratic polynomial.

seq(0,2,,30) -> x
rnorm(30,mean = 0,sd = .2) -> error
x + x**2 + error -> y
plot(x,y)
lm(y ~ x) -> out
plot (x,out$residuals)
yline(0)

The resulting residual plot will look far from random and exhibit an obvious curve.

Another case is when one or more values are unusual relative to the others. To illustrate this situation, first just use the previous x and error values to generate a data set that follows the linear model correctly.

1.0 + 3.0*x + error -> y
lm(y ~ x) -> out
plot(x,out$residuals)
yline(0)

This residual plot should look random. There will be no obvious pattern to the scatter in the values of the residuals. Now add an ``outlier:'' replace the 15th y value with something much larger, say 10.2.

10.2 -> y[15]

Repeating the lm and plot steps above gives a residual plot where the 15th value deviates from zero significantly more than the others. Obviously by our construction, it should not have been there in the first place! Type plot(x,y) to see how ``out of place'' the point looks.

Sometimes you would like to recalculate the least squares fit having omitted a data point. This process is now illustrated with the climate data.

climate$jan[c(-38,-45)] -> jan.new
climate$lat[c(-38,-45)] -> lat.new
lm(jan.new ~ lat.new) -> zork.new

Here the 38th and 45th points have been omitted. What is special about these cities? ( climate$city lists the names.) Can you give some geographic or climatological reasons for why these data points are unusual?


On Your Own

  1. The data ncsu contains the number of degrees awarded by North Carolina State University from 1894 to 1983 (ncsu$degree and ncsu$year). Fit a least squares line to predict the number of degrees as a linear function of the year. From your estimated equation find the estimate of the number of degrees at the years 1900 and 1980. Comment on the appropriateness of this model.

  2. In place of the raw number of degrees, consider the natural logarithm of the number ( log(ncsu$degree) -> log.degree). Carefully describe the functional relationship between degrees and years based on this transformation. Fit a least squares line to the log degrees as a function of the year. Comment on any patterns evident in the residuals and try to explain why they might be there. Use your fitted line to predict the number of degrees that will be awarded in the year 2000. Comment on the validity of your prediction. Suggest a different strategy that might improve your prediction (hint: consider only a subset of the data).

  3. The data set earthq includes the dominant frequency and magnitude of 148 earthquakes (taken from Earthquake Engineering and Structural Dynamics, Vol. 23, p. 583-597, 1994). Make a scatter plot and then fit a least squares line to explain the relationship of earthq$freq (the dependent variable Y) to earthq$mag (the independent variable X). Superimpose the fitted straight line over the scatter plot of the data. Then plot the residuals versus earthq$mag. Do you detect any nonlinearity from the residual plot?

  4. How much energy do you save by insulating your house? The data set insulate.dat taken from A Handbook of Small Data Sets is one person's record of weekly gas consumption (gas, in 1000 cubic feet) and outside temperature (temp, in degrees Celsius), before ( insulation=0) and after (insulation=1) insulating a house. The house thermostat was set at 20 degrees Celsius during the 26 weeks before and 30 weeks after insulating. First make a plot of the data with

    insulate.dat -> z
    lplot(z$temp,z$gas,z$insulation)

    Then create two data sets, z[1:26,]->z1, z[27:56,]->z2, and fit simple linear models of gas versus temp for each data set. Overlay the fitted lines on the above plot with the lines command. Also plot the residuals. Are there any patterns or unusual points? What conclusions can you draw about insulating a house?

  5. How does a child's vocabulary change with age? The data set vocab.dat contains the average oral vocabulary size (words = number of words) for children at different ages (age) (taken from Discovering Psychology by Weiner, 1977). Plot the data and fit a straight line of words versus age. Overlay the fitted line and plot the residuals. What conclusions can you draw? Are there any unusual patterns or data points?

  6. One view of the relation of lung cancer to amount of cigarette smoking can be seen in the data set cancer.dat taken from A Handbook of Small Data Sets by Hand, et al. (1994, p.67). The data consist of the ``smoking ratio,'' ( smoke, a standardized measure of smoking amount) and the standardized mortality ratio (SMR) for males in England and Wales in 1970-72 who were working in 25 different broad groups of jobs such as textile workers, miners, etc. Plot the data, fit a straight line, and plot the residuals. What conclusions can you draw?