Experiments with factorial treatment structure often have some factors with quantitative levels, that is levels on a numerical scale like diameter or area rather than qualitative levels like ``male'' and ``female'' or ``poor,'' ``average,'' and ``high'' quality. For numerical factors the techniques introduced in Lab 4 (line fitting) are more powerful than the means comparison techniques of Lab 3 (ANOVA). Thus in this lab we want to consider data sets with at least some numerical factors and show how to take advantage of those numerical factors in the analysis.
To make it as simple as possible,
we will assume that all qualitative factors have only two levels which
have been converted to levels named 0 and 1. For example, suppose that we
have the usual numerical response variable Y, one numerical variable
and a qualitative variable
with values 0 and 1. The linear models we consider are
or
The
term in (2) is an interaction term, but simpler to
interpret here than in Lab 3.
Now when
,
(1) simplifies to
the model for a simple straight line considered in Lab 4.
When
,
(1) becomes
in other words a line with the same slope
as in (3) but with intercept
.
So fitting the first model results in two parallel lines.
Similarly, equation (2) simplifies to (3) when
,
but when
it becomes
a line with a different slope and intercept from (3). Thus fitting model (2) results in two completely different lines with different slopes and intercepts.
In the first section below we consider this kind of example. In the second section we consider an experiment with several numerical factors.
Several students studied the relationship between file size (in bytes) and transfer times (in seconds) using ftp (File Transfer Protocol) to retrieve files from several Internet locations. At each location three different files were transferred 5 times, and the transfer times were recorded. The 5 times for each file are not true experimental replications but rather are repeated measurements used to reduce variability due to network traffic. Thus we average over the 5 repetitions to obtain
viewdata(ftp)
Obs size time loc 1 1283 0.206 0 2 1255 0.848 1 3 21463 1.428 0 4 21463 8.000 1 5 367343 14.000 0 6 367037 158.000 1This reduction to a smaller data set is not necessary for the analysis, but without it one has to be careful about interpretation of some of the output from lm (especially p-values). Thus we have found it wise to use the smaller data set. Note also that the files were chosen in order to have one small file, one medium file, and one large file from each site. One could then use ANOVA methods with those qualitative levels, but here we will use the actual sizes instead.
The first thing to do is to plot the data. For example, try the following plots to decide what measurement scale to work in. (Note that you can cut and paste the following block of code directly into MATLAB. Hit return to make sure the last command is executed.)
z=ftp
subplot(2,2,1)
lplot(z.size,z.time,z.loc)
title('time vs. size')
subplot(2,2,2)
lplot(log(z.size),z.time,z.loc)
title('time vs. log(size)')
subplot(2 ,2 ,3)
lplot(log(z.size),log(z.time),z.loc)
title('log(time) vs. log(size)')
subplot(2,2,4)
lplot(sqrt(z.size),sqrt(z.time),z.loc)
title('sqrt(time) vs. sqrt(size)')
After viewing these, we decided to use a log scale in both the time and size variable. We now use lm to fit models (1) and (2) above:
First Model:
>> lm(z) Variable Names in Input Dataset size time loc Enter class or model statement or type ex for examples lm>> class loc Enter model statement lm>> model log(time)=log(size) +loc Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 11.73930 11.73930 99.3963 0.00214740 log(size) 1 22.28420 22.28420 188.6790 0.00083495 loc 1 5.19090 5.19090 43.9508 0.00699110 Error 3 0.35432 0.11811 R-square 0.98727 Standard Error 0.34367 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept -7.84850 0.636940 -12.3222 0.00115130 log(size) 0.83356 0.060636 13.7470 0.00083299 loc 1.86030 0.280600 6.6295 0.00699110
Second model:
lm>> model log(time)=log(size)+loc +loc*log(size) Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 11.73930 11.73930 215.2031 0.0046146 log(size) 1 22.28420 22.28420 408.5092 0.0024390 loc 1 5.19090 5.19090 95.1579 0.0103460 loc*log(size) 1 0.24522 0.24522 4.4953 0.1680800 Error 2 0.10910 0.05455 R-square 0.99608 Standard Error 0.23356 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept -6.97310 0.598200 -11.65680 0.0072791 log(size) 0.74586 0.058387 12.77430 0.0060723 loc 0.11669 0.844180 0.13823 0.9027200 loc*log(size) 0.17474 0.082418 2.12020 0.1680800Notice that both models have very high
lm>> subplot(2,1,1)and plot the residuals:
lm>> pplot by log(size)
lm>> subplot(2,1,2)These plots show that the fits are pretty good although the residual plots show some slight hint of the need for quadratic terms. We would need more data, though, to consider these more complicated quadratic models.
lm>> rplot by log(size)
The general equation used to calculate the capacitance of a capacitor is
where k is a property of the dielectic, A is the area of the metal, and d is the distance of separation. From this equation it would appear that only these three variables determine the capacitance.
However, there is fringing of the electric field between two metal plates with opposite electric charges. Thus a team of students proposed that the capacitance might depend weakly on the shape of the capacitor. They built five shapes of capacitors with approximately the same area for three basic area sizes. The data are as follows:
viewdata(capac)
Obs capac shape perim area discont 1 0.141 cir 10.6 9 0 2 0.174 poly 10.9 9 8 3 0.165 star 17.2 9 16 4 0.196 tri 13.7 9 3 5 0.198 squ 12.0 9 4 6 0.218 cir 14.2 16 0 7 0.327 poly 14.6 16 8 8 0.324 star 22.9 16 16 9 0.397 tri 18.2 16 3 10 0.412 squ 16.0 16 4 11 0.668 cir 17.7 25 0 12 0.571 poly 18.2 25 8 13 0.689 star 28.6 25 16 14 0.637 tri 22.8 25 3 15 0.788 squ 20.0 25 4The five shapes are circle, octagon ( poly), star, triangle, and square. A straightforward comparison of means based on shape and area was suggested in problem 4 of Lab 3. Here we want to use the numerical variables area (area ), perimeter ( perim) and the number of discontinuities ( discont) to see if we can refine the analysis.
After making some preliminary plots we tried
lm>> lm(capac)Unfortunately, there does not seem to be much effect due to the size of the perimeter or to the number of discontinuities (from looking at the high p-values .4989 and .6293). We could also try interactions among the variables, but when the variables are not individually important, it is usually the case that multiplying them with other variables does not add much to the fit. Thus we next tried the very simple model with just area as an explanatory variable. Continuing within lm,
lm>> model capac = area+perim+discont Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 2.3246000 2.3246000 434.47090 3.4287e-10 area 1 0.6293900 0.6293900 117.63400 3.2653e-07 perim 1 0.0013167 0.0013167 0.24609 0.62960000 discont 1 0.0013189 0.0013189 0.24650 0.62932000 Error 11 0.0588550 0.0053504 R-square 0.91481 Standard Error 0.073147 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept -0.1632700 0.0743580 -2.19580 0.05045600 area 0.0279300 0.0055898 4.99660 0.00040466 perim 0.0062768 0.0089768 0.69922 0.49893000 discont -0.0026379 0.0053130 -0.49649 0.62932000
lm>> model capac = area Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 2.32460 2.32460 491.4576 1.0323e-11 area 1 0.62939 0.62939 133.0632 3.3511e-08 Error 13 0.06149 0.00473 R-square 0.911 Standard Error 0.068775 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept -0.127640 0.0485560 -2.6287 0.02084000 area 0.031278 0.0027115 11.5353 3.3511e-08which results in an
lm>> clf lm>> rplot by areasuggests that a squared term might help the fit:
lm>> model capac = area +area^2 Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 2.324600 2.3246000 546.4379 2.2418e-11 area 1 0.629390 0.6293900 147.9493 4.1587e-08 area^2 1 0.010441 0.0104410 2.4543 0.14318000 Error 12 0.051049 0.0042541 R-square 0.92611 Standard Error 0.065223 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept 0.09631400 0.15018000 0.641310 0.53338 area 0.00070456 0.01968400 0.035793 0.97204 area^2 0.00089067 0.00056853 1.566600 0.14318The squared term increases the
z= magnet
mplot(z.force,z.volt,z.turns)
Now fit models motivated by this display. For example, lm(z) followed by
model force = volt+turns
or
model force = volt+turns+turns^2
or
model force = volt+turns+turns^2+volt*turns
For your final model, plot the data with the predicted values versus turns: within lm type pplot by turns. Explain what your final model says about the relationship between magnetic force and volt and turns.
mplot(z.resistance,z.gauge,z.length)
to see it displayed. Next fit a linear model
lm(z)
model resistance=gauge+length
look at the output and plot the residuals versus gauge and length with
rplot by gauge
rplot by length
Decide whether to add
quadratic terms and/or interaction terms. Another approach might be to
fit a full quadratic model (with terms
)
and drop out terms which are not important. Using either approach,
write down your final model
and predict the resistance of a wire have gauge=17 AWG and length=25
feet.
lm(insulate)
model gas = temp + insulation + temp*insulation
By fitting it in this fashion, we can look at one of the coefficients and decide whether we need separate slope estimates for the before and after parts of the data or whether a single common slope estimate will suffice for both parts of the data. Which coefficient do we need to look at to tell us this information? What is your final fitted model? Write down the final models for the before and after data by plugging in insulation=0 and insulation=1 into your final fitted model.
pplot by temp lines=1
to see the fitted curve with the data points.
model time=host+graphics+host*graphics
If host is statistically significant, then the four hosts have regression lines with different intercepts. If host*graphics is statistically significant, then the four hosts have regression lines with different slopes as well. So, find the four individual least squares lines by creating four individual data sets. Here is code for the first one:
>> t1.graphics=webhost.graphics(1:4) >> t1.time=webhost.time(1:4) >> lm(t1) lm>> model time=graphicsState any conclusions that you think are appropriate.
a) one regression line (processor size is unimportant)
b) two regression lines with different intercepts but the same slope (processor size is important but the interaction of pr01 with pixels is not)
c) two regression lines with different intercepts and different slopes (pr01, pixels, and pr01*pixels are all important).
Pick the most appropriate model and report the least squares line or
lines. Within
lm plot
the residuals from your chosen fit using
lm>> rplot by pixels
What do you think needs to be done to improve the fit?