Experiments with factorial treatment structure often have some factors with quantitative levels, that is levels on a numerical scale like diameter or area rather than qualitative levels like ``male'' and ``female'' or ``poor,'' ``average,'' and ``high'' quality. For numerical factors the techniques introduced in Lab 4 (line fitting) are more powerful than the means comparison techniques of Lab 3 (ANOVA). Thus in this lab we want to consider data sets with at least some numerical factors and show how to take advantage of those numerical factors in the analysis.
To make it as simple as possible,
we will assume that all qualitative factors have only two levels which
have been converted to levels named 0 and 1. For example, suppose that we
have the usual numerical response variable Y, one numerical variable
and a qualitative variable
with values 0 and 1. The linear models we consider are
or
The
term in (2) is an interaction term, but simpler to
interpret here than in Lab 3.
Now when
,
(1) simplifies to
the model for a simple straight line considered in Lab 4.
When
,
(1) becomes
in other words a line with the same slope
as in (3) but with intercept
.
So fitting the first model results in two parallel lines.
Similarly, equation (2) simplifies to (3) when
,
but when
it becomes
a line with a different slope and intercept from (3). Thus fitting model (2) results in two completely different lines with different slopes and intercepts.
In the first section below we consider this kind of example. In the second section we consider an experiment with several numerical factors.
Several students studied the relationship between file size (in bytes) and transfer times (in seconds) using ftp (File Transfer Protocol) to retrieve files from several Internet locations. At each location three different files were transferred 5 times, and the transfer times were recorded. The 5 times for each file are not true experimental replications but rather are repeated measurements used to reduce variability due to network traffic. Thus we average over the 5 repetitions to obtain
ftp.exp
size time loc 1 1283 0.206 0 2 1255 0.848 1 3 21463 1.428 0 4 21463 8.000 1 5 367343 14.000 0 6 367037 158.000 1This reduction to a smaller data set is not necessary for the analysis, but without it one has to be careful about interpretation of some of the output from lm (especially p-values). Thus we have found it wise to use the smaller data set. Note also that the files were chosen in order to have one small file, one medium file, and one large file from each site. One could then use ANOVA methods with those qualitative levels, but here we will use the actual sizes instead.
The first thing to do is to plot the data. For example, try a few of the following plots to decide what measurement scale to work in:
ftp.exp -> z
set.panel(2,2)
lplot(z$size,z$time,z$loc)
lplot(log(z$size),z$time,z$loc)
lplot(log(z$size),log(z$time),z$loc)
lplot(sqrt(z$size),sqrt(z$time),z$loc)
After viewing these, we decided to use a log scale in both the time and size variable. We now use lm to fit models (1) and (2) above:
lm(log(time)~
log(size)+loc,data=ftp.exp)->out1
summary(out1)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -7.8485 0.6369 -12.3222 0.0012
log(size) 0.8336 0.0606 13.7470 0.0008
loc 1.8603 0.2806 6.6295 0.0070
Residual standard error: 0.3437 on 3 degrees of freedom
Multiple R-Squared: 0.9873
lm(log(time)~log(size)+loc+loc*log(size),data=ftp.exp)->out2
summary(out2)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -6.9731 0.5982 -11.6568 0.0073
log(size) 0.7459 0.0584 12.7743 0.0061
loc 0.1167 0.8442 0.1382 0.9027
loc:log(size) 0.1747 0.0824 2.1202 0.1681
Residual standard error: 0.2336 on 2 degrees of freedom
Multiple R-Squared: 0.9961
Notice that both models have very high
set.panel(2,1)
lplot(log(z$size),log(z$time),z$loc)
title("Log(time) versus Log(size) for Two Internet Sites")
-7.8485+0.8336*log(z$size)->y0log
lines(log(z$size),y0log)
(-7.8485+1.8603)+0.8336*log(z$size)->y1log
lines(log(z$size),y1log)
and plot the residuals:
lplot(log(z$size),out1$residuals,z$loc)
yline(0)
title("Residuals from Parallel Line Fit vs. Log(Size)")
These plots show that the fits are pretty good although the residual plots show some slight hint of the need for a quadratic term at each site (but with reverse signs). We would need more data to consider these more complicated quadratic models.
To get further help on overlaying lines and curves, click on overlaying lines tutorial.
The general equation used to calculate the capacitance of a capacitor is
where k is a property of the dielectic, A is the area of the metal, and d is the distance of separation. From this equation it would appear that only these three variables determine the capacitance.
However, there is fringing of the electric field between two metal plates with opposite electric charges. Thus a team of students proposed that the capacitance might depend weakly on the shape of the capacitor. They built five shapes of capacitors with approximately the same area for three basic area sizes. The data are as follows:
capac.shape
capac shape area perim discont area.ch 1 0.141 cir 9 10.6 0 9 2 0.174 poly 9 10.9 8 9 3 0.165 star 9 17.2 16 9 4 0.196 tri 9 13.7 3 9 5 0.198 squ 9 12.0 4 9 6 0.218 cir 16 14.2 0 16 7 0.327 poly 16 14.6 8 16 8 0.324 star 16 22.9 16 16 9 0.397 tri 16 18.2 3 16 10 0.412 squ 16 16.0 4 16 11 0.668 cir 25 17.7 0 25 12 0.571 poly 25 18.2 8 25 13 0.689 star 25 28.6 16 25 14 0.637 tri 25 22.8 3 25 15 0.788 squ 25 20.0 4 25The five shapes are circle, octagon ( poly), star, triangle, and square. A straightforward comparison of means based on shape and area.ch was suggested in problem 4 of Lab 3. Here we want to use the numerical variables area ( area), perimeter( perim) and the number of discontinuities ( discont) to see if we can refine the analysis.
After making some preliminary plots we tried
lm(capac~
area+perim+discont,data=capac.shape)->capac.out1
summary(capac.out1)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -0.1633 0.0744 -2.1958 0.0505
area 0.0279 0.0056 4.9966 0.0004
perim 0.0063 0.0090 0.6992 0.4989
discont -0.0026 0.0053 -0.4965 0.6293
Residual standard error: 0.07315 on 11 degrees of freedom
Multiple R-Squared: 0.9148
Unfortunately, there does not seem to be much effect due to the size of
the perimeter or to the number of discontinuities (from looking at the
high p-values .4989 and .6293). We could also try interactions among the
variables, but when the variables are not individually important, it is
usually the case that multiplying them with other variables does not add
much to the fit.
Thus we next tried the
very simple model with just
area
as an explanatory variable:
lm(capac~area,data=capac.shape)->capac.out2
which results in an
of .911. Since this
is quite close to the
.9148 of the first run, we have another confirmation that
perim and
discont are not important. A plot of residuals
plot(capac.shape$area,capac.out2$residuals)
suggests that a squared term might help the fit:
lm(capac~area+area**2,data=capac.shape)->capac.out3
Since the output looked a bit odd, we decided to center the area variable before squaring it:
capac.shape$area-16->capac.shape$area.m
lm(capac~area.m+area.m**2,data=capac.shape)->capac.out3
summary(capac.out3)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.3356 0.0292 11.5054 0.0000
area.m 0.0292 0.0029 10.1000 0.0000
I(area.m^2) 0.0009 0.0006 1.5666 0.1432
Residual standard error: 0.06522 on 12 degrees of freedom
Multiple R-Squared: 0.9261
The squared term increases the
magnet -> z
mplot(z$force,z$volt,z$turns,both=T)
Now fit models motivated by this display. For your final model, plot the data with
lplot(z$turns,z$force,z$volt)
and overlay the fitted model: first create a grid of values for the turns variable
seq(100,300,,50) -> x
and then create a vector y1.5 for your final model at volt =1.5 and a vector y3.0 at volt=3.0 for these x values; e.g., a linear fit from lm placed in out (i.e., lm(force~turns+volt,z)->out ) would be handled by
out$coeff[1]+out$coeff[2]*x+out$coeff[3]*1.5->y1.5
out$coeff[1]+out$coeff[2]*x+out$coeff[3]*3.0->y3.0
(Recall that out$coeff[1] is the intercept estimate, out$coeff[2] is the slope estimate associated with turns, and out$coeff[3] is the slope estimate associated with volt. They are in the order given in the lm statement. Just type out$coeff to see this.) Then add the lines by
lines(x,y1.5)
lines(x,y3.0)
To get further help on overlaying lines and curves, click on overlaying lines tutorial.
lm(volume~brand+temp+quantity+shake+brand*temp
+brand*quantity+brand*shake+temp*quantity+
temp*shake+quantity*shake,data=popcorn.n) -> pop.out1
anova(pop.out1)
Throw out the terms with high p-values and refit with lm. This time use summary and plot the residuals. Consider whether other terms can be dropped out and whether log(volume) might be a better response variable.
mplot(z$resistance,z$gauge,z$length,both=T)
to see it displayed. Alternative displays could be made with
lplot(z$resistance,z$gauge,z$length)
lplot(z$resistance,z$length,z$gauge)
but mplot is good for factors with only a few values. Next fit a linear model
lm(resistance~gauge+length,z)->out1
and look at the output with summary(out1), and plot the residuals versus gauge and length with
lplot(z$gauge,out1$resid,z$length)
lplot(z$length,out1$resid,z$gauge)
Decide whether to add
quadratic terms and/or interaction terms. Another approach might be to
fit a full quadratic model (with terms
)
and drop out terms which are not important. Using either approach,
write down your final model
and predict the resistance of a wire have gauge=17 AWG and length=25
feet.
insulate.dat -> z
lm(gas ~ temp + insulation + temp*insulation,z)->out1
By fitting it in this fashion, we can look at one of the coefficients and decide whether we need separate slope estimates for the before and after parts of the data or whether a single common slope estimate will suffice for both parts of the data. Which coefficient do we need to look at to tell us this information? What is your final fitted model? Write down the final models for the before and after data by plugging in insulation=0 and insulation=1 into your final fitted model.