Suppose that pairs of measurements are made where the mean of one variable is a linear function of the other. For example, suppose that Y is the average minimum January (AMJ) temperature for a city in the US, and X is its latitude. These two variables have been collected at the locations of the 50 largest cities in the US and are in the MATLAB data sets climate.jan and climate.lat. Examining a scatter plot of AJM temperature versus latitude,
plot(climate.lat,climate.jan,'*')
one can see that these variables are related, and moreover
the relationship appears to be linear.
(Note that the MATLAB
plot
function uses the form plot(x,y), with the independent variable first.)
A reasonable model for AMJ
temperature is that on the average, it is equal to
where X is the latitude.
One problem with this fairly vague description is that it does
not provide enough information to estimate
and
.
Here is a more precise statistical model. Let
,
...,
denote the 50 pairs of latitude and AJM
temperatures. Then we will assume that
The random components
are assumed to be independent random variables with
zero mean (
) and a constant variance
(
). This model explains
the variation in the AJM
temperatures as having two components: it is a linear
function of the latitude plus a random perturbation. The
combination of the two will give a scatterplot that appears roughly
along a line but with some random departures from one case to
another. This viewpoint can be exploited to estimate the linear
relationship in an objective manner.
For any choice of slope and intercept (
and
),
one can predict the AMJ temperature from the latitude value. The
idea of least squares is to find the choice of slope and intercept
that give the best fit among the observed AMJ temperatures and
those predicted from the latitudes. Here is how this works in
mathematical notation. Let
Then the least squares estimates of
and
are the
values that minimize this sum of squares. For simple models and
small data sets, the estimates of
and
have a
formula that can be evaluated using a calculator.
However, our interest is in using MATLAB to do the
calculations so that we can look at larger data sets and eventually
fit more complicated curves.
Recall from M-Lab 3 that
the MATLAB function
lm is activated by typing
lm(data set).
Here are the steps to use
lm
for the data set
climate
with a model which includes as response (
lm(climate)
We obtain then the following message:
Variable Names in Input Dataset lat jan rain city jul elev lon Enter class or model statement or type ex for examples
The MATLAB prompt will change to lm>> (but here we suppress that notation since you don't actually type it) and we define the model to fit using the command
model jan = lat
The results of this command are
Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 45777.3282 45777.3282 638.4939 0.00000000 lat 1 4797.7835 4797.7835 66.9186 1.1773e-10 Error 48 3441.3983 71.6958 R-square 0.58231 Standard Error 8.4673 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept 99.9955 8.60870 11.6157 1.5543e-15 lat -1.8898 0.23102 -8.1804 1.1773e-10
The above output is similar to that shown in Lab 3 but also has some additional lines with the parameter estimates along with their standard errors (estimated standard deviations). For an explanation of some of the items, click on annotated output. Thus the fitted model is
jan = 99.9955 + -1.8898*latHere is the MATLAB command to plot the response (Y=jan) versus the regressor (X=lat) and add the least square line:
pplot by lat lines=1
To plot response versus regressor adding the predicted values, type
pplot by lat
The predicted values are circles in the plot, and the crossed points are the observed values.
The predicted values are actually
,
,
where
and
are the least squares
estimates. Another very important set of values are the
residuals
.
Note that the
residuals plus the predicted values equal the original Y values.
The residuals
estimate the random component of the model, the ei in
the model
If the model actually
fits the data well, the residuals should appear randomly
distributed and not have any patterns. In this case the standard
deviation of the residuals estimates
. (Actually we
divide by n-2 instead of n-1 in the standard deviation formula. This
estimate of
is called "Standard Error" in the output above.)
Usually we plot the residuals to look for
a) outliers (unusual points)
b) patterns which might indicate curvature
c) patterns which suggest that the variance of the errors ei changes with the x variable
If the residuals do have some pattern or unusual values, then the model is possibly incorrect. In the case of outliers, we might decide to delete them from the data set. If there seems to be curvature, then a straight line fit is not appropriate. Transforming the y value (like log(y)) sometimes helps get rid of the curvature.
Note, though, that the residuals always have a mean of zero due to the the least squares fitting process, and therefore we often draw a reference horizontal line at y=0. An example with the climate data is as follows. (If you are not still in lm, then type lm(climate) followed by model model jan = lat. Then
rplot by lat
Note the two outliers in the upper right portion of the plot. We discussed those two cities in M-Lab 1. Another interesting residual plot, it is the residuals versus predicted values:
rplot by predict
This plot is often used to see if the variation about the center line is greater for larger values of the predicted Y's. If so, then a transformation like log(y) is often suggested. There is no evidence of a need for a transformation in this example.
You can place a variable named resid that contains the residuals into the main workspace by typing (please do this)
output r=resid
Similarly you can place a variable named predict that contains the predicted values into the main workspace by typing (please do this)
output p=predict
To get out of
lm,
just hit return.
For more information about
lm
type
help lm
once you are out of
lm.
Just for fun, copy over the following commands in one piece to see how you can use the resid and predict vectors outside lm.
subplot(2,1,1) plot(climate.lat,climate.jan,'*') hold on plot(climate.lat,predict) hold off subplot(2,1,2) plot(climate.lat,resid,'*')
(Hint: look at the actual degrees granted in those years versus the estimated number.)
z=insulate
plot(z.temp,z.gas,'*')
Since the there appears to be two distinct groups, lets make two data sets depending on the value of z.insulation:
z0=substr(insulate,'insulation=0')
z1=substr(insulate,'insulation=1')
Now let's plot them and put least squares lines through them using lsline.
plot(z1.temp,z1.gas,'*')
lsline
hold on
plot(z0.temp,z0.gas,'+')
lsline
hold off
Find the least squares lines for z0 and z1 using lm and compare them. What conclusions can you draw about insulating a house?
lm>> subplot(2,1,1) lm>> pplot by dist lines=1 lm>> subplot(2,1,2) lm>> rplot by distWrite down the least squares equation. Do either of the two plots indicate that a straight line is inappropriate or that there are outliers or unusual patterns?
>> viewdata(us_age,3) Obs all f m year 1 36.4 37.6 35.0 1999 2 36.2 37.5 34.9 1998 3 36.1 37.4 34.7 1997It contains the average age for all Americans (all), females (f), and males (m) for the years 1990-1999. Actually, the data from the US Census Bureau are based on the 1990 census and then updated yearly. First plot the data by copying and pasting
plot(us_age.year,us_age.all,'*',us_age.year,us_age.f,'+',... us_age.year,us_age.m,'o') axis([1989 2000 33 39])(Recall that the dots ... allow one to continue to the next line in matlab. Also, you can put multiple plots on the same graph using one plot statement.) Find separate least squares fits of average age to year for females and males. State the meaning of the slopes in this situation (e.g., what happens to the mean age if you change the year from one value to another?). Is one group lengthening its age faster than the other? To actually test this hypothesis, look at the slope and p-value for a least squares fit of the age difference between females and males to year.
z=us_pop
z1=subset(z,'(obs#>70)')
z2=subset(z,'(obs#<31)')
Use lm to get least squares lines for the two data sets. Recall that after the model statement, pplot by year lines=1 will show whether a straight line is appropriate. Then explain whether the straight lines are appropriate in these cases and how the two growth rates compare. These data are from the US Census Bureau website.