Analysis of the LA pollution data.

The Los Angeles pollution data was downloaded from the NMMAPS database. The data set contains daily values (from 1/1/1987 to 12/31/2000) of the following variables: The pollution variables (O2, SO2, NO2, and CO) have been centered to have mean zero. There are several days with missing data which you may discard.

Our objective is to determine if daily ambient air pollution levels are associated with cardiovascular mortality. To do this, we must first account for confounders such as day of the week, seasonal trends, temperature, and humidity. We will account for these confounders using a generalized additive model (GAM).

First we account for long-term trends (e.g., a flu outbreak) using a GAM model with a single predictor, date, and 100 degrees of freedom.

proc gam data=la;
model Deaths = spline(Date,df=100);
output out=predictedval all;
run;

title "Raw data vs Predicted values";
proc gplot data=predictedval;
plot Deaths*Date;
plot2 P_Deaths*Date;
run;


This model includes dummy variables for season and day of the week (handled the same way as in proc reg) as well as nonparametric curves for the long-term trend and temperature.

proc gam data=la;
class Day_of_week season;
model Deaths = param(Day_of_week season) spline(Date,df=100) spline(Temp,df=10);
output out=predictedval all;
run;

* Add the linear component (x*beta) and the nonparametric component (s(x)) to get the entire nonparametric curve (x*beta+s(x));
data predictedval;
set predictedval;
fitted_Date = P_Date-0.00379*Date;
fitted_Temp = P_Temp+0.19566*Temp;
run;

title "Raw data vs Predicted values";
proc gplot data=predictedval;
plot Deaths*Date;
plot2 P_Deaths*Date;
run;

title "Estimated smooth function of Temperature";
proc gplot data=predictedval;
plot fitted_Temp*Temp;
run;

title "Estimated smooth function of Date";
proc gplot data=predictedval;
plot fitted_Date*Date;
run;