Take home final exam
Half of your final exam grade will be a take-home analysis of the LA pollution data described
below. Your paper should be 5-7 pages long (double-spaced, including all figures
and tables). Your paper is due at the time of the final exam, 12/14 at 1pm. Papers will not be
accepted after 12/14. The paper should be written in complete sentences and paragraphs
in language appropriate for non-statisticians.
Since this is an exam, collaboration with other students is not allowed.
The Los Angeles pollution data was downloaded from the
NMMAPS database. The data set contains daily values (from 1/1/1987 to 12/31/2000) of the following variables:
- Date: MM/DD/YYYY
- Day of week: 1=Sunday, 2=Monday, ..., 7=Saturday
- Season: 1=Winter, 2=Spring, 3=Summer, 4=Fall
- Deaths: Number of cardiovascular deaths in LA
- Deaths_Detrended: residuals from a GAM with outcome Deaths and a 10 dof/year spline function of time
- Temp: Daily average temperature
- RelHumid: Daily average relative humidity
- O2: Daily average ozone
- O2_prev_day: Previous day's O2
- O2_wk_ave: Average O2 over the past 7 days
- CO: Daily average carbon monoxide
- CO_prev_day: Previous day's CO
- CO_wk_ave: Average CO over the past 7 days
The pollution variables O2 and CO have been centered to have mean zero.
There are several days with missing data which you may discard.
Your objective is to determine if ambient air pollution levels are associated with cardiovascular
mortality. You should use "Deaths_Detrended" (possibly transformed) as the outcome in a multiple
linear regression analysis. You should include the following sections:
- Introduction: Describe the scientific problem, the data, and your
objections. Outline the remainder of the paper.
- Variable selection: Determine the subset of predictors you will use for
mortality. You may use any of the variables in the data set (other than
"Deaths", of course) as predictors. You may also include polynomial
terms, interactions, and nonparametric curves. The inclusion/exclusion of
variables should be justified with statistical evidence, statistical
algorithms, or scientific reasoning.
- Residual analysis: Include plots and diagnostics to
justify the validity of your final model.
- Results: Present the results of your analysis. Which confounders are
significantly associated with mortality? Are any of the pollution
variables associated with mortality? How well does your model fit the
data? Are your results sensitive to outliers or collinearity?
- Conclusions: Summarize your findings. Discuss any
limitations to the study design or data. Are we missing any important
confounders?