Lab 5: Curve Fitting in Factorial Studies

Experiments with factorial treatment structure often have some factors with quantitative levels, that is levels on a numerical scale like diameter or area rather than qualitative levels like ``male'' and ``female'' or ``poor,'' ``average,'' and ``high'' quality. For numerical factors the techniques introduced in Lab 4 (line fitting) are more powerful than the means comparison techniques of Lab 3 (ANOVA). Thus in this lab we want to consider data sets with at least some numerical factors and show how to take advantage of those numerical factors in the analysis.

To make it as simple as possible, we will assume that all qualitative factors have only two levels which have been converted to levels named 0 and 1. For example, suppose that we have the usual numerical response variable Y, one numerical variable tex2html_wrap_inline61 and a qualitative variable tex2html_wrap_inline63 with values 0 and 1. The linear models we consider are

displaymath57

or

equation27

The tex2html_wrap_inline65 term in (2) is an interaction term, but simpler to interpret here than in Lab 3.

Now when tex2html_wrap_inline67, (1) simplifies to

equation36

the model for a simple straight line considered in Lab 4. When tex2html_wrap_inline69, (1) becomes

displaymath57

in other words a line with the same slope tex2html_wrap_inline71 as in (3) but with intercept tex2html_wrap_inline73. So fitting the first model results in two parallel lines.

Similarly, equation (2) simplifies to (3) when tex2html_wrap_inline67, but when tex2html_wrap_inline69 it becomes

equation43

a line with a different slope and intercept from (3). Thus fitting model (2) results in two completely different lines with different slopes and intercepts.

In the first section below we consider this kind of example. In the second section we consider an experiment with several numerical factors.

Modeling FTP Times on the Internet

Several students studied the relationship between file size (in bytes) and transfer times (in seconds) using ftp (File Transfer Protocol) to retrieve files from several Internet locations. At each location three different files were transferred 5 times, and the transfer times were recorded. The 5 times for each file are not true experimental replications but rather are repeated measurements used to reduce variability due to network traffic. Thus we average over the 5 repetitions to obtain

ftp.exp

    size    time loc
1   1283   0.206   0
2   1255   0.848   1
3  21463   1.428   0
4  21463   8.000   1
5 367343  14.000   0
6 367037 158.000   1
This reduction to a smaller data set is not necessary for the analysis, but without it one has to be careful about interpretation of some of the output from lm (especially p-values). Thus we have found it wise to use the smaller data set. Note also that the files were chosen in order to have one small file, one medium file, and one large file from each site. One could then use ANOVA methods with those qualitative levels, but here we will use the actual sizes instead.

The first thing to do is to plot the data. For example, try a few of the following plots to decide what measurement scale to work in:

ftp.exp -> z
set.panel(2,2)
lplot(z$size,z$time,z$loc)
lplot(log(z$size),z$time,z$loc)
lplot(log(z$size),log(z$time),z$loc)
lplot(sqrt(z$size),sqrt(z$time),z$loc)

After viewing these, we decided to use a log scale in both the time and size variable. We now use lm to fit models (1) and (2) above:

lm(log(time)~ log(size)+loc,data=ftp.exp)->out1
summary(out1)

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -7.8485   0.6369   -12.3222   0.0012
  log(size)   0.8336   0.0606    13.7470   0.0008
        loc   1.8603   0.2806     6.6295   0.0070

Residual standard error: 0.3437 on 3 degrees of freedom
Multiple R-Squared: 0.9873

lm(log(time)~log(size)+loc+loc*log(size),data=ftp.exp)->out2
summary(out2)

Coefficients:
                 Value Std. Error  t value Pr(>|t|)
  (Intercept)  -6.9731   0.5982   -11.6568   0.0073
    log(size)   0.7459   0.0584    12.7743   0.0061
          loc   0.1167   0.8442     0.1382   0.9027
loc:log(size)   0.1747   0.0824     2.1202   0.1681

Residual standard error: 0.2336 on 2 degrees of freedom
Multiple R-Squared: 0.9961
Notice that both models have very high values, .9873 and 9961, but that the p-value for the interaction term loc:log(size) is not too small, 0.1681. So on the basis of the these two outputs, model (1) without the interaction seems adequate here. Recall that ``no interaction'' means parallel lines but different intercepts. To visually confirm our model we will overlay the fitted lines over the data:

set.panel(2,1)
lplot(log(z$size),log(z$time),z$loc)
title("Log(time) versus Log(size) for Two Internet Sites")
-7.8485+0.8336*log(z$size)->y0log
lines(log(z$size),y0log)
(-7.8485+1.8603)+0.8336*log(z$size)->y1log
lines(log(z$size),y1log)

and plot the residuals:

lplot(log(z$size),out1$residuals,z$loc)
yline(0)
title("Residuals from Parallel Line Fit vs. Log(Size)")

These plots show that the fits are pretty good although the residual plots show some slight hint of the need for a quadratic term at each site (but with reverse signs). We would need more data to consider these more complicated quadratic models.

To get further help on overlaying lines and curves, click on overlaying lines tutorial.


An Experiment with Capacitors

The general equation used to calculate the capacitance of a capacitor is

where k is a property of the dielectic, A is the area of the metal, and d is the distance of separation. From this equation it would appear that only these three variables determine the capacitance.

However, there is fringing of the electric field between two metal plates with opposite electric charges. Thus a team of students proposed that the capacitance might depend weakly on the shape of the capacitor. They built five shapes of capacitors with approximately the same area for three basic area sizes. The data are as follows:

capac.shape

   capac shape area perim discont area.ch
 1 0.141   cir    9  10.6       0       9
 2 0.174  poly    9  10.9       8       9
 3 0.165  star    9  17.2      16       9
 4 0.196   tri    9  13.7       3       9
 5 0.198   squ    9  12.0       4       9
 6 0.218   cir   16  14.2       0      16
 7 0.327  poly   16  14.6       8      16
 8 0.324  star   16  22.9      16      16
 9 0.397   tri   16  18.2       3      16
10 0.412   squ   16  16.0       4      16
11 0.668   cir   25  17.7       0      25
12 0.571  poly   25  18.2       8      25
13 0.689  star   25  28.6      16      25
14 0.637   tri   25  22.8       3      25
15 0.788   squ   25  20.0       4      25
The five shapes are circle, octagon ( poly), star, triangle, and square. A straightforward comparison of means based on shape and area.ch was suggested in problem 4 of Lab 3. Here we want to use the numerical variables area ( area), perimeter( perim) and the number of discontinuities ( discont) to see if we can refine the analysis.

After making some preliminary plots we tried

lm(capac~ area+perim+discont,data=capac.shape)->capac.out1
summary(capac.out1)

Coefficients:
              Value Std. Error t value Pr(>|t|)
(Intercept) -0.1633  0.0744    -2.1958  0.0505
       area  0.0279  0.0056     4.9966  0.0004
      perim  0.0063  0.0090     0.6992  0.4989
    discont -0.0026  0.0053    -0.4965  0.6293

Residual standard error: 0.07315 on 11 degrees of freedom
Multiple R-Squared: 0.9148
Unfortunately, there does not seem to be much effect due to the size of the perimeter or to the number of discontinuities (from looking at the high p-values .4989 and .6293). We could also try interactions among the variables, but when the variables are not individually important, it is usually the case that multiplying them with other variables does not add much to the fit. Thus we next tried the very simple model with just area as an explanatory variable:

lm(capac~area,data=capac.shape)->capac.out2

which results in an of .911. Since this is quite close to the .9148 of the first run, we have another confirmation that perim and discont are not important. A plot of residuals

plot(capac.shape$area,capac.out2$residuals)

suggests that a squared term might help the fit:

lm(capac~area+area**2,data=capac.shape)->capac.out3

Since the output looked a bit odd, we decided to center the area variable before squaring it:

capac.shape$area-16->capac.shape$area.m
lm(capac~area.m+area.m**2,data=capac.shape)->capac.out3
summary(capac.out3)

Coefficients:
              Value Std. Error t value Pr(>|t|)
(Intercept)  0.3356  0.0292    11.5054  0.0000
     area.m  0.0292  0.0029    10.1000  0.0000
I(area.m^2)  0.0009  0.0006     1.5666  0.1432

Residual standard error: 0.06522 on 12 degrees of freedom
Multiple R-Squared: 0.9261
The squared term increases the from .911 to .926 which is not much of an improvement.


On Your Own

  1. Analyze the data in magnet using volt and turns to explain the magnetic force. See Problem 3 of Lab 3 for a means analysis. Here we want to take advantage of the fact that both of the independent variables are numerical. Even so, a good display is mplot introduced in Lab 3:

    magnet -> z
    mplot(z$force,z$volt,z$turns,both=T)

    Now fit models motivated by this display. For your final model, plot the data with

    lplot(z$turns,z$force,z$volt)

    and overlay the fitted model: first create a grid of values for the turns variable

    seq(100,300,,50) -> x

    and then create a vector y1.5 for your final model at volt =1.5 and a vector y3.0 at volt=3.0 for these x values; e.g., a linear fit from lm placed in out (i.e., lm(force~turns+volt,z)->out ) would be handled by

    out$coeff[1]+out$coeff[2]*x+out$coeff[3]*1.5->y1.5
    out$coeff[1]+out$coeff[2]*x+out$coeff[3]*3.0->y3.0

    (Recall that out$coeff[1] is the intercept estimate, out$coeff[2] is the slope estimate associated with turns, and out$coeff[3] is the slope estimate associated with volt. They are in the order given in the lm statement. Just type out$coeff to see this.) Then add the lines by

    lines(x,y1.5)
    lines(x,y3.0)

    To get further help on overlaying lines and curves, click on overlaying lines tutorial.

  2. The data set popcorn.n is the data dataset popcorn with the factor levels changed to numbers so that a linear model can be easily found. Start with

    lm(volume~brand+temp+quantity+shake+brand*temp
    +brand*quantity+brand*shake+temp*quantity+
    temp*shake+quantity*shake,data=popcorn.n) -> pop.out1
    anova(pop.out1)

    Throw out the terms with high p-values and refit with lm. This time use summary and plot the residuals. Consider whether other terms can be dropped out and whether log(volume) might be a better response variable.

  3. Analyze the data set solar.cells using the numerical variables cell, light, and distance as factors with volt as the response variable. (Look back at Lab 3, problem 5, for comparison.)

  4. The data set wire.resist contains the results of an experiment to study the relationship between the resistance of a wire (Y, in ohms) and its gauge ( , in AWG units) and its length ( , in feet). Both factors are numerical and have three values each. For simplicity of typing, first put the data into z with wire.resist -> z. Now type z to print out the data and

    mplot(z$resistance,z$gauge,z$length,both=T)

    to see it displayed. Alternative displays could be made with

    lplot(z$resistance,z$gauge,z$length)
    lplot(z$resistance,z$length,z$gauge)

    but mplot is good for factors with only a few values. Next fit a linear model

    lm(resistance~gauge+length,z)->out1

    and look at the output with summary(out1), and plot the residuals versus gauge and length with

    lplot(z$gauge,out1$resid,z$length)
    lplot(z$length,out1$resid,z$gauge)

    Decide whether to add quadratic terms and/or interaction terms. Another approach might be to fit a full quadratic model (with terms ) and drop out terms which are not important. Using either approach, write down your final model and predict the resistance of a wire have gauge=17 AWG and length=25 feet.

  5. In problem 4 of Lab 4, we fit separate lines to the data in insulate.dat by first breaking the data set into two pieces. Here we want to perform a similar fit with just one lm statement:

    insulate.dat -> z
    lm(gas ~ temp + insulation + temp*insulation,z)->out1

    By fitting it in this fashion, we can look at one of the coefficients and decide whether we need separate slope estimates for the before and after parts of the data or whether a single common slope estimate will suffice for both parts of the data. Which coefficient do we need to look at to tell us this information? What is your final fitted model? Write down the final models for the before and after data by plugging in insulation=0 and insulation=1 into your final fitted model.