M-Lab 5: Curve Fitting in Factorial Studies

Experiments with factorial treatment structure often have some factors with quantitative levels, that is levels on a numerical scale like diameter or area rather than qualitative levels like ``male'' and ``female'' or ``poor,'' ``average,'' and ``high'' quality. For numerical factors the techniques introduced in Lab 4 (line fitting) are more powerful than the means comparison techniques of Lab 3 (ANOVA). Thus in this lab we want to consider data sets with at least some numerical factors and show how to take advantage of those numerical factors in the analysis.

To make it as simple as possible, we will assume that all qualitative factors have only two levels which have been converted to levels named 0 and 1. For example, suppose that we have the usual numerical response variable Y, one numerical variable tex2html_wrap_inline61 and a qualitative variable tex2html_wrap_inline63 with values 0 and 1. The linear models we consider are

displaymath57

or

equation27

The tex2html_wrap_inline65 term in (2) is an interaction term, but simpler to interpret here than in Lab 3.

Now when tex2html_wrap_inline67, (1) simplifies to

equation36

the model for a simple straight line considered in Lab 4. When tex2html_wrap_inline69, (1) becomes

displaymath57

in other words a line with the same slope tex2html_wrap_inline71 as in (3) but with intercept tex2html_wrap_inline73. So fitting the first model results in two parallel lines.

Similarly, equation (2) simplifies to (3) when tex2html_wrap_inline67, but when tex2html_wrap_inline69 it becomes

equation43

a line with a different slope and intercept from (3). Thus fitting model (2) results in two completely different lines with different slopes and intercepts.

In the first section below we consider this kind of example. In the second section we consider an experiment with several numerical factors.

Modeling FTP Times on the Internet

Several students studied the relationship between file size (in bytes) and transfer times (in seconds) using ftp (File Transfer Protocol) to retrieve files from several Internet locations. At each location three different files were transferred 5 times, and the transfer times were recorded. The 5 times for each file are not true experimental replications but rather are repeated measurements used to reduce variability due to network traffic. Thus we average over the 5 repetitions to obtain

viewdata(ftp)

Obs size    time loc
1   1283   0.206   0
2   1255   0.848   1
3  21463   1.428   0
4  21463   8.000   1
5 367343  14.000   0
6 367037 158.000   1
This reduction to a smaller data set is not necessary for the analysis, but without it one has to be careful about interpretation of some of the output from lm (especially p-values). Thus we have found it wise to use the smaller data set. Note also that the files were chosen in order to have one small file, one medium file, and one large file from each site. One could then use ANOVA methods with those qualitative levels, but here we will use the actual sizes instead.

The first thing to do is to plot the data. For example, try the following plots to decide what measurement scale to work in. (Note that you can cut and paste the following block of code directly into MATLAB. Hit return to make sure the last command is executed.)

z=ftp
subplot(2,2,1)
lplot(z.size,z.time,z.loc)
title('time vs. size')
subplot(2,2,2)
lplot(log(z.size),z.time,z.loc)
title('time vs. log(size)')
subplot(2 ,2 ,3)
lplot(log(z.size),log(z.time),z.loc)
title('log(time) vs. log(size)')
subplot(2,2,4)
lplot(sqrt(z.size),sqrt(z.time),z.loc)
title('sqrt(time) vs. sqrt(size)')

After viewing these, we decided to use a log scale in both the time and size variable. We now use lm to fit models (1) and (2) above:

First Model:


>> lm(z)

Variable Names in Input Dataset

  size
  time
  loc

Enter class or model statement or type ex for examples
lm>> class loc

Enter model statement


lm>> model log(time)=log(size) +loc

Sequential Sums of Squares ANOVA Table

Source       df          SS          MS           F         p-val

Intercept    1    11.73930    11.73930     99.3963    0.00214740
log(size)    1    22.28420    22.28420    188.6790    0.00083495
loc          1     5.19090     5.19090     43.9508    0.00699110
Error        3     0.35432     0.11811

R-square  0.98727

Standard Error  0.34367

Parameter Estimates

Source       Parameter Estimate    Std. Error           t         p-val

Intercept              -7.84850      0.636940    -12.3222    0.00115130
log(size)               0.83356      0.060636     13.7470    0.00083299
loc                     1.86030      0.280600      6.6295    0.00699110

Second model:


lm>> model log(time)=log(size)+loc +loc*log(size)

Sequential Sums of Squares ANOVA Table

Source           df          SS          MS           F        p-val

Intercept        1    11.73930    11.73930    215.2031    0.0046146
log(size)        1    22.28420    22.28420    408.5092    0.0024390
loc              1     5.19090     5.19090     95.1579    0.0103460
loc*log(size)    1     0.24522     0.24522      4.4953    0.1680800
Error            2     0.10910     0.05455

R-square  0.99608

Standard Error  0.23356

Parameter Estimates

Source           Parameter Estimate    Std. Error            t        p-val

Intercept                  -6.97310      0.598200    -11.65680    0.0072791
log(size)                   0.74586      0.058387     12.77430    0.0060723
loc                         0.11669      0.844180      0.13823    0.9027200
loc*log(size)               0.17474      0.082418      2.12020    0.1680800
Notice that both models have very high values, .9873 and 9961, but that the p-value for the interaction term loc*log(size) is not too small, 0.1681. So on the basis of the these two outputs, model (1) without the interaction seems adequate here. Recall that ``no interaction'' means parallel lines but different intercepts. To visually confirm our model, we will plot the predicted values (circles) along with original points. Continuing within lm

lm>> subplot(2,1,1)
lm>> pplot by log(size)
and plot the residuals:

lm>> subplot(2,1,2)
lm>> rplot by log(size)
These plots show that the fits are pretty good although the residual plots show some slight hint of the need for quadratic terms. We would need more data, though, to consider these more complicated quadratic models.


An Experiment with Capacitors

The general equation used to calculate the capacitance of a capacitor is

where k is a property of the dielectic, A is the area of the metal, and d is the distance of separation. From this equation it would appear that only these three variables determine the capacitance.

However, there is fringing of the electric field between two metal plates with opposite electric charges. Thus a team of students proposed that the capacitance might depend weakly on the shape of the capacitor. They built five shapes of capacitors with approximately the same area for three basic area sizes. The data are as follows:

viewdata(capac)

Obs   capac   shape   perim   area   discont

  1   0.141     cir    10.6      9         0
  2   0.174    poly    10.9      9         8
  3   0.165    star    17.2      9        16
  4   0.196     tri    13.7      9         3
  5   0.198     squ    12.0      9         4
  6   0.218     cir    14.2     16         0
  7   0.327    poly    14.6     16         8
  8   0.324    star    22.9     16        16
  9   0.397     tri    18.2     16         3
 10   0.412     squ    16.0     16         4
 11   0.668     cir    17.7     25         0
 12   0.571    poly    18.2     25         8
 13   0.689    star    28.6     25        16
 14   0.637     tri    22.8     25         3
 15   0.788     squ    20.0     25         4
The five shapes are circle, octagon ( poly), star, triangle, and square. A straightforward comparison of means based on shape and area was suggested in problem 4 of Lab 3. Here we want to use the numerical variables area (area ), perimeter ( perim) and the number of discontinuities ( discont) to see if we can refine the analysis.

After making some preliminary plots we tried


lm>> lm(capac) 
lm>> model capac = area+perim+discont Sequential Sums of Squares ANOVA Table Source df SS MS F p-val Intercept 1 2.3246000 2.3246000 434.47090 3.4287e-10 area 1 0.6293900 0.6293900 117.63400 3.2653e-07 perim 1 0.0013167 0.0013167 0.24609 0.62960000 discont 1 0.0013189 0.0013189 0.24650 0.62932000 Error 11 0.0588550 0.0053504 R-square 0.91481 Standard Error 0.073147 Parameter Estimates Source Parameter Estimate Std. Error t p-val Intercept -0.1632700 0.0743580 -2.19580 0.05045600 area 0.0279300 0.0055898 4.99660 0.00040466 perim 0.0062768 0.0089768 0.69922 0.49893000 discont -0.0026379 0.0053130 -0.49649 0.62932000
Unfortunately, there does not seem to be much effect due to the size of the perimeter or to the number of discontinuities (from looking at the high p-values .4989 and .6293). We could also try interactions among the variables, but when the variables are not individually important, it is usually the case that multiplying them with other variables does not add much to the fit. Thus we next tried the very simple model with just area as an explanatory variable. Continuing within lm,

lm>> model capac = area

Sequential Sums of Squares ANOVA Table

Source       df         SS         MS           F         p-val

Intercept     1    2.32460    2.32460    491.4576    1.0323e-11
area          1    0.62939    0.62939    133.0632    3.3511e-08
Error        13    0.06149    0.00473

R-square  0.911

Standard Error  0.068775

Parameter Estimates

Source       Parameter Estimate    Std. Error          t         p-val

Intercept             -0.127640     0.0485560    -2.6287    0.02084000
area                   0.031278     0.0027115    11.5353    3.3511e-08
which results in an of .911. Since this is quite close to the .9148 of the first run, we have another confirmation that perim and discont are not important. Continuing within lm, A plot of residuals

lm>> clf
lm>> rplot by area
suggests that a squared term might help the fit:

lm>> model capac = area +area^2

Sequential Sums of Squares ANOVA Table

Source       df          SS           MS           F         p-val

Intercept     1    2.324600    2.3246000    546.4379    2.2418e-11
area          1    0.629390    0.6293900    147.9493    4.1587e-08
area^2        1    0.010441    0.0104410      2.4543    0.14318000
Error        12    0.051049    0.0042541

R-square  0.92611

Standard Error  0.065223

Parameter Estimates

Source       Parameter Estimate    Std. Error           t      p-val

Intercept            0.09631400    0.15018000    0.641310    0.53338
area                 0.00070456    0.01968400    0.035793    0.97204
area^2               0.00089067    0.00056853    1.566600    0.14318
The squared term increases the from .911 to .926 which is not much of an improvement. (Note by the way that including the area^2 term makes the p-value for area .972 in the next to the last line. This doesn't mean that area is not important, but that when area^2 is in the model, area itself is not very important. In general, remember that when we include a term like area^2, then we always want to include area as well.)


On Your Own

  1. Analyze the data in magnet using volt and turns to explain the magnetic force. See Problem 3 of Lab 3 for a means analysis. Here we want to take advantage of the fact that both of the independent variables are numerical. Even so, a good display is mplot introduced in Lab 3:

    z= magnet
    mplot(z.force,z.volt,z.turns)

    Now fit models motivated by this display. For example, lm(z) followed by

    model force = volt+turns
    or
    model force = volt+turns+turns^2
    or
    model force = volt+turns+turns^2+volt*turns

    For your final model, plot the data with the predicted values versus turns: within lm type pplot by turns. Explain what your final model says about the relationship between magnetic force and volt and turns.

  2. Analyze the data set solar using the numerical variables cell, light, and distance as factors with volt as the response variable. (Look back at Lab 3, problem 5, for comparison.)

  3. The data set wire contains the results of an experiment to study the relationship between the resistance of a wire (Y, in ohms) and its gauge ( , in AWG units) and its length ( , in feet). Both independent variables are numerical and have three values each. For simplicity of typing, first put the data into z with z=wire. Now type z to print out the data and

    mplot(z.resistance,z.gauge,z.length)

    to see it displayed. Next fit a linear model

    lm(z)
    model resistance=gauge+length

    look at the output and plot the residuals versus gauge and length with

    rplot by gauge
    rplot by length

    Decide whether to add quadratic terms and/or interaction terms. Another approach might be to fit a full quadratic model (with terms ) and drop out terms which are not important. Using either approach, write down your final model and predict the resistance of a wire have gauge=17 AWG and length=25 feet.

  4. In problem 4 of Lab 4, we fit separate lines to the data in insulate by first breaking the data set into two pieces. Here we want to perform a similar fit with just one lm statement:

    lm(insulate)
    model gas = temp + insulation + temp*insulation

    By fitting it in this fashion, we can look at one of the coefficients and decide whether we need separate slope estimates for the before and after parts of the data or whether a single common slope estimate will suffice for both parts of the data. Which coefficient do we need to look at to tell us this information? What is your final fitted model? Write down the final models for the before and after data by plugging in insulation=0 and insulation=1 into your final fitted model.

  5. Using the marathon data set, trying fitting the men's winning times to a linear function of temperature. Within lm make an rplot by temp and see if the residuals suggest the need for a quadratic term. Then fit the quadratic model and give the estimated equation. Is this a better fit than the linear fit? In addition to p values you might look at either a residual plot (rplot by temp) or type

    pplot by temp lines=1

    to see the fitted curve with the data points.

  6. The data set golf has the scores of 195 rounds of golf made by professional golfers and some additional variables that try to explain how their score depends on certain golf skills such as putting and chipping. First read the description of the data set in golf and then try to find a good multiple regression equation relating score and these independent variables. Give your best fit and the associated R^2.

  7. Are free Webhosts all created equal? A student team decided to compare four free hosting services in the spring of 2000: go.com, angelfire.com, geocities.com, and xoom.com. They uploaded four pages: The last page wouldn't load for xoom; so they had 15 data points with load times for the response variable and the number of graphic images for a quantitative independent variable. The data are in webhost. First make an mplot and then run lm with host as a class variable and model statement

    model time=host+graphics+host*graphics

    If host is statistically significant, then the four hosts have regression lines with different intercepts. If host*graphics is statistically significant, then the four hosts have regression lines with different slopes as well. So, find the four individual least squares lines by creating four individual data sets. Here is code for the first one:

    >> t1.graphics=webhost.graphics(1:4)
    >> t1.time=webhost.time(1:4)
    >> lm(t1)
    lm>> model time=graphics
    
    State any conclusions that you think are appropriate.

  8. The data set framerate has dependent variable rate (frames per second) and independent variables pixels (number of pixels, for example 800x600 resolution has 480000 pixels) and pr01 (0=Intel Celeron 333 processor, 1= Pentium II 450 processor). Use lm to decide which of the following models is most appropriate:

    a) one regression line (processor size is unimportant)

    b) two regression lines with different intercepts but the same slope (processor size is important but the interaction of pr01 with pixels is not)

    c) two regression lines with different intercepts and different slopes (pr01, pixels, and pr01*pixels are all important).

    Pick the most appropriate model and report the least squares line or lines. Within lm plot the residuals from your chosen fit using

    lm>> rplot by pixels

    What do you think needs to be done to improve the fit?

  9. The experiment that produced the data set bread2 is explained in the M-Lab 3 exercises. Try to find an appropriate regression model between height and temp and yeast. Write down separate fitted equations for temp = 350 and temp = 475.