M-Lab 4: Estimating a Linear Relationship

Much of the practical use of statistics involves modeling and estimating the relationship among two or more variables. We will concentrate here on simple linear regression which is the case of just two variables, Y and X. We first discuss the statistical model which relates the dependent variable Y to the independent variable X, and then introduce the curve fitting method called least squares.

A Statistical Model for a Linear Relationship

Suppose that pairs of measurements are made where the mean of one variable is a linear function of the other. For example, suppose that Y is the average minimum January (AMJ) temperature for a city in the US, and X is its latitude. These two variables have been collected at the locations of the 50 largest cities in the US and are in the MATLAB data sets climate.jan and climate.lat. Examining a scatter plot of AJM temperature versus latitude,

plot(climate.lat,climate.jan,'*')

one can see that these variables are related, and moreover the relationship appears to be linear. (Note that the MATLAB plot function uses the form plot(x,y), with the independent variable first.) A reasonable model for AMJ temperature is that on the average, it is equal to where X is the latitude.

One problem with this fairly vague description is that it does not provide enough information to estimate and . Here is a more precise statistical model. Let , ..., denote the 50 pairs of latitude and AJM temperatures. Then we will assume that

The random components are assumed to be independent random variables with zero mean ( ) and a constant variance ( ). This model explains the variation in the AJM temperatures as having two components: it is a linear function of the latitude plus a random perturbation. The combination of the two will give a scatterplot that appears roughly along a line but with some random departures from one case to another. This viewpoint can be exploited to estimate the linear relationship in an objective manner.



Least Squares Estimates

For any choice of slope and intercept ( and ), one can predict the AMJ temperature from the latitude value. The idea of least squares is to find the choice of slope and intercept that give the best fit among the observed AMJ temperatures and those predicted from the latitudes. Here is how this works in mathematical notation. Let

Then the least squares estimates of and are the values that minimize this sum of squares. For simple models and small data sets, the estimates of and have a formula that can be evaluated using a calculator. However, our interest is in using MATLAB to do the calculations so that we can look at larger data sets and eventually fit more complicated curves.



The MATLAB Function lm

Recall from M-Lab 3 that the MATLAB function lm is activated by typing lm(data set). Here are the steps to use lm for the data set climate with a model which includes as response (Y) climate.jan and climate.lat as an independent variable (X). First, we type

lm(climate)

We obtain then the following message:

Variable Names in Input Dataset

  lat
  jan
  rain
  city
  jul
  elev
  lon

Enter class or model statement or type ex for examples

The MATLAB prompt will change to lm>> (but here we suppress that notation since you don't actually type it) and we define the model to fit using the command

model jan = lat

The results of this command are

Sequential Sums of Squares ANOVA Table

Source       df            SS            MS           F         p-val

Intercept     1    45777.3282    45777.3282    638.4939    0.00000000
lat           1     4797.7835     4797.7835     66.9186    1.1773e-10
Error        48     3441.3983       71.6958

R-square  0.58231

Standard Error  8.4673

Parameter Estimates

Source       Parameter Estimate    Std. Error          t         p-val

Intercept               99.9955       8.60870    11.6157    1.5543e-15
lat                     -1.8898       0.23102    -8.1804    1.1773e-10

The above output is similar to that shown in Lab 3 but also has some additional lines with the parameter estimates along with their standard errors (estimated standard deviations). For an explanation of some of the items, click on annotated output. Thus the fitted model is

jan = 99.9955 + -1.8898*lat
Here is the MATLAB command to plot the response (Y=jan) versus the regressor (X=lat) and add the least square line:

pplot by lat lines=1

To plot response versus regressor adding the predicted values, type

pplot by lat

The predicted values are circles in the plot, and the crossed points are the observed values.

Scrutinizing the Residuals

The predicted values are actually , , where and are the least squares estimates. Another very important set of values are the residuals . Note that the residuals plus the predicted values equal the original Y values. The residuals estimate the random component of the model, the ei in the model

If the model actually fits the data well, the residuals should appear randomly distributed and not have any patterns. In this case the standard deviation of the residuals estimates . (Actually we divide by n-2 instead of n-1 in the standard deviation formula. This estimate of is called "Standard Error" in the output above.)

Usually we plot the residuals to look for

a) outliers (unusual points)

b) patterns which might indicate curvature

c) patterns which suggest that the variance of the errors ei changes with the x variable

If the residuals do have some pattern or unusual values, then the model is possibly incorrect. In the case of outliers, we might decide to delete them from the data set. If there seems to be curvature, then a straight line fit is not appropriate. Transforming the y value (like log(y)) sometimes helps get rid of the curvature.

Note, though, that the residuals always have a mean of zero due to the the least squares fitting process, and therefore we often draw a reference horizontal line at y=0. An example with the climate data is as follows. (If you are not still in lm, then type lm(climate) followed by model model jan = lat. Then

rplot by lat

Note the two outliers in the upper right portion of the plot. We discussed those two cities in M-Lab 1. Another interesting residual plot, it is the residuals versus predicted values:

rplot by predict

This plot is often used to see if the variation about the center line is greater for larger values of the predicted Y's. If so, then a transformation like log(y) is often suggested. There is no evidence of a need for a transformation in this example.

You can place a variable named resid that contains the residuals into the main workspace by typing (please do this)

output r=resid

Similarly you can place a variable named predict that contains the predicted values into the main workspace by typing (please do this)

output p=predict

To get out of lm, just hit return. For more information about lm type help lm
once you are out of lm.

Just for fun, copy over the following commands in one piece to see how you can use the resid and predict vectors outside lm.

subplot(2,1,1)
plot(climate.lat,climate.jan,'*')
hold on
plot(climate.lat,predict)
hold off
subplot(2,1,2)
plot(climate.lat,resid,'*')


On Your Own

  1. The data ncsu contains the number of degrees awarded by North Carolina State University from 1894 to 1983 (ncsu.degree and ncsu.year). Fit a least squares line to predict the number of degrees as a linear function of the year. From your estimated equation find the estimate of the number of degrees at the years 1900 and 1980. Comment on the appropriateness of this model.

    (Hint: look at the actual degrees granted in those years versus the estimated number.)

  2. In place of the raw number of degrees, consider the natural logarithm of the number (use log(degree) within lm). Carefully describe the functional relationship between degrees and years based on this transformation. Fit a least squares line to the log degrees as a function of the year. Comment on any patterns evident in the residuals and try to explain why they might be there. Use your fitted line to predict the number of degrees that will be awarded in the year 2000. Comment on the validity of your prediction. Suggest a different strategy that might improve your prediction (hint: consider only a subset of the data).

  3. The data set earthq includes the dominant frequency and magnitude of 148 earthquakes (taken from Earthquake Engineering and Structural Dynamics, Vol. 23, p. 583-597, 1994). Fit a least squares line to explain the relationship of freq (the dependent variable Y) to mag (the independent variable X). Superimpose the fitted straight line over the scatter plot of the data (recall, pplot by mag lines=1). Then plot the residuals versus mag (rplot by mag). Do you detect any nonlinearity from either plot?

  4. How much energy do you save by insulating your house? The data set insulate taken from A Handbook of Small Data Sets is one person's record of weekly gas consumption (gas, in 1000 cubic feet) and outside temperature (temp, in degrees Celsius), before ( insulation=0) and after (insulation=1) insulating a house. The house thermostat was set at 20 degrees Celsius during the 26 weeks before and 30 weeks after insulating. First make a plot of the data with

    z=insulate
    plot(z.temp,z.gas,'*')

    Since the there appears to be two distinct groups, lets make two data sets depending on the value of z.insulation:

    z0=substr(insulate,'insulation=0')
    z1=substr(insulate,'insulation=1')

    Now let's plot them and put least squares lines through them using lsline.

    plot(z1.temp,z1.gas,'*')
    lsline
    hold on
    plot(z0.temp,z0.gas,'+')
    lsline
    hold off

    Find the least squares lines for z0 and z1 using lm and compare them. What conclusions can you draw about insulating a house?

  5. How does a child's vocabulary change with age? The data set vocab contains the average oral vocabulary size (words = number of words) for children at different ages (age) (taken from Discovering Psychology by Weiner, 1977). Plot the data and fit a straight line of words versus age. Overlay the fitted line and plot the residuals. What conclusions can you draw? Are there any unusual patterns or data points?

  6. One view of the relation of lung cancer to amount of cigarette smoking can be seen in the data set cancer taken from A Handbook of Small Data Sets by Hand, et al. (1994, p.67). The data consist of the ``smoking ratio,'' ( smoke, a standardized measure of smoking amount) and the standardized mortality ratio (SMR) for males in England and Wales in 1970-72 who were working in 25 different broad groups of jobs such as textile workers, miners, etc. Plot the data, fit a straight line, and plot the residuals. What conclusions can you draw?

  7. Temperature should have an effect on how fast one can run a race. The data set marathon lists daily temperature and the winning times at the New York marathon for both men and women for the years 1978-1998. Fit a straight line for the women's times marathon.wtime versus the temperature marathon.temp. Give the fitted equation and tell whether a straight line seems appropriate.

  8. Repeat the previous question using the men's times marathon.mtime in place of the women's times.

  9. In Spring 2000 a team measured ping times of Internet servers at various distances from Raleigh using a software program called NeoTrace. They actually measured ping times at 4 different times of the day, but since there was very little difference over time, we have averaged over the times of day and constructed the data set ping with variables dist and time. Use lm to find the least squares line of ping time versus distance. Within lm make plots with
    lm>> subplot(2,1,1)
    lm>> pplot by dist lines=1
    lm>> subplot(2,1,2)
    lm>> rplot by dist
    
    Write down the least squares equation. Do either of the two plots indicate that a straight line is inappropriate or that there are outliers or unusual patterns?

  10. Is the US population getting older? The first three lines of the data set us_age is
    >> viewdata(us_age,3)
    
    Obs    all      f      m   year
    
      1   36.4   37.6   35.0   1999
      2   36.2   37.5   34.9   1998
      3   36.1   37.4   34.7   1997
    
    It contains the average age for all Americans (all), females (f), and males (m) for the years 1990-1999. Actually, the data from the US Census Bureau are based on the 1990 census and then updated yearly. First plot the data by copying and pasting
    plot(us_age.year,us_age.all,'*',us_age.year,us_age.f,'+',...
    us_age.year,us_age.m,'o')
    axis([1989 2000 33 39])
    
    (Recall that the dots ... allow one to continue to the next line in matlab. Also, you can put multiple plots on the same graph using one plot statement.) Find separate least squares fits of average age to year for females and males. State the meaning of the slopes in this situation (e.g., what happens to the mean age if you change the year from one value to another?). Is one group lengthening its age faster than the other? To actually test this hypothesis, look at the slope and p-value for a least squares fit of the age difference between females and males to year.

  11. How fast is the US population growing? us_pop contains the US population in millions for the years 1900-1999. First plot the data and notice that the growth rate at the beginning of the century seems to be less than at the end. Then create two data sets

    z=us_pop
    z1=subset(z,'(obs#>70)')
    z2=subset(z,'(obs#<31)')

    Use lm to get least squares lines for the two data sets. Recall that after the model statement, pplot by year lines=1 will show whether a straight line is appropriate. Then explain whether the straight lines are appropriate in these cases and how the two growth rates compare. These data are from the US Census Bureau website.

  12. One of the problems at the end of M-Lab 2 used the data set draft to investigate the randomness of the 1970 Draft Lottery. In a followup study, Sommers (2003, Chance Magazine) looked up deaths by age and birthday on the Vietnam Veterans Memorial (apparently available online at thewall-usa.com). Thus, the data set has deaths by month as well as average lottery number by month for men with birthdays in the years 1944-1950 (those eligible in the 1970 draft). First, plot the deaths by average monthly lottery number, plot(draft.lottnum,draft.deaths,'*') (for fun, add the month names with text(draft.lottnum,draft.deaths,draft.month) . Then use lm to fit a straight line fit of deaths versus average monthly lottery number. Which month has the largest residual from the straight line fit? It appears that getting a low lottery number was not just unlucky but translated into more deaths as might be expected since relatively more men were drafted from the low lottery number months.