Conceptually, we think of a sample of data being a
representative subset of a much larger population of measurements.
For example, the S data set precip.may contains average
precipitation measurements for May in Raleigh from the past 45
years. We will assume that these measurements are representative
of a much larger record of May precipitation. It is this hypothetical
larger
record that we would identify as the population. Based on this setup,
we have the following relationship between populations and samples:
The most important concept based on this table is that the features
of the sample estimate the features of the population. In
particular,
for the precipitation sample is an
estimate of
, the long term average of May precipitation in
Raleigh. Also note that in other labs we have used the connection
between sample histograms and density functions to estimate the
shape of density functions for different types of random variables.
So
is an estimate of
.
But how accurate
is it? When the sample size is large, one can rely on the
central limit theorem to use the error bound
The correct interpretation of this error bound is as follows: on average, intervals of the form
contain
, the population mean, 95% of the time.
(The value 1.96 is obtained from the normal probability distribution and
sets the confidence level at 95%).
Based on the May precipitation data one can compute the 95%
confidence interval of (3.37,4.30). The main goal of this lab is to
investigate whether this statistical measure of accuracy is actually sound.
In order to evaluate how well confidence intervals work, it
will be necessary to apply the method to many data sets. Also, in
order to judge the validity of the interval, we need to know the
value of the population mean. Given these requirements, it should
be obvious that we can not just use the precip.may data
set since this
is just one sample from the population and more importantly, we
don't know
.
The way around this problem is to choose a
probability density function that is similar to the histogram of
the observed data. From the probability density function, one can
simulate as many samples as needed. Also, since the theoretical
distribution is now known, one can work out the precise value for
.
Based on the data precip.may, one could use a normal
distribution with a mean of 3.84 and a standard deviation of
1.6 to approximate the
distribution of the observed data. Below is an example of
simulating a sample from this distribution and calculating the 95%
confidence interval.
rnorm(45,mean = 3.84,sd = 1.6) -> sample
mean(sample) -> m
sd(sample) -> s
m - 1.96*s/sqrt(45) -> ci.lower
m + 1.96*s/sqrt(45) -> ci.upper
You might repeat these steps several times and see whether the interval [ci.lower,ci.upper] actually contains the population mean 3.84. (By the way, if you are not sure why m does not come out to be exactly 3.84, you have missed a basic point and need to reread Section 9.2.)
If you just evaluated a few 95% confidence intervals, it
should not come as a surprise that all of them cover
.
In
order to determine how close the frequency coverage is to 95%, it
is necessary to simulate many data sets. Obviously, we need some
automatic way of generating many samples. Typing the five commands
above several thousand times may get a little boring!
Instead we have made a special S function
simulate.samples to do the repetitions:
simulate.samples(1000,45,DIST=rnorm,mean=3.84,sd=1.6) -> samples
samples$m - 1.96*samples$s/sqrt(45) -> ci.lower
samples$m + 1.96*samples$s/sqrt(45) -> ci.upper
The result is the same as the first example except that now we have confidence intervals from 1000 independent samples in the data sets ci.lower and ci.upper. Note that besides generating the 1000 data sets of 45 observations, simulate.samples also computes the means and standard deviations. These are returned as components with names m and s. The actual data that was simulated is returned in a component named dat. In this case it will be a matrix with 1000 rows and 45 columns. The DIST argument should be the name of the . function that generates random numbers from a particular distribution. So, for example, if you used the command
simulate.samples(100,13,DIST=rgam,alpha=5.76,beta=.667) -> zork
zork$dat would be a matrix of 100 rows and 13 columns,
where each number is randomly generated from a gamma distribution
(see Lab 6) with
and
.
zork$m would contain the means of these 100 samples,
and zork$s would have the standard deviations.
What should one do with the 1000 confidence intervals from
simulated samples? The first step is to see if they really do have
a 95% coverage rate. One way to check this is to have S count the number of
times
is less than the lower limit and the number of times
that
is greater than the upper limit. Adding these together
gives the number of intervals that fail to contain
.
For example:
count(3.84 < ci.lower) -> bad.lower
count(ci.upper < 3.84) -> bad.upper
(bad.lower + bad.upper) /1000 -> bad.frac
bad.frac
Of course we expect that bad.frac should be close to .05.
Also, at a finer level one wants the bad cases to be symmetric.
That is bad.lower and bad.upper should be about equal
in number. Here is another way of evaluating the intervals: plot
the lower versus the upper limits and then draw horizontal and
vertical lines at
.
plot(ci.lower,ci.upper,xlab="Lower End of CI",ylab="Upper End of CI")
title("Simulating 1000 Samples of Size 13 from a Gamma Dist.",cex=.8)
xline(3.84)
yline(3.84)
The top plot above implements this code. The lines divide the plot into four regions. The upper left hand region contains the ``good'' intervals, the upper right hand region contains the bad.lower intervals, and the lower left hand region contains the bad.upper intervals.
Another aspect to consider is the average width of the intervals or
the distribution of the widths. Even if the intervals contain
95% of the time, if they are very wide, the method would not
be useful. To analyze this feature one could use
ci.upper - ci.lower -> width
to get the widths and then either stats(width) or hplot(width) to summarize the distribution of the widths.
Examples of questions to investigate: