Lab 7: Building Confidence in Confidence Intervals

Confidence intervals are one of the most important ways that statisticians quantify the error in an estimate. This lab will allow you to analyze the confidence interval method for the mean of several different distributions. For some distributions and sample sizes, the confidence intervals work well. For others they do not. Your job is to try to identify some situations where the usual confidence interval is a reliable method. In carrying out this analysis, you may notice that you are using many of the ideas from previous labs.

Populations Versus Samples

Conceptually, we think of a sample of data being a representative subset of a much larger population of measurements. For example, the S data set precip.may contains average precipitation measurements for May in Raleigh from the past 45 years. We will assume that these measurements are representative of a much larger record of May precipitation. It is this hypothetical larger record that we would identify as the population. Based on this setup, we have the following relationship between populations and samples:

The most important concept based on this table is that the features of the sample estimate the features of the population. In particular, for the precipitation sample is an estimate of , the long term average of May precipitation in Raleigh. Also note that in other labs we have used the connection between sample histograms and density functions to estimate the shape of density functions for different types of random variables.



Large Sample Confidence Intervals

So is an estimate of . But how accurate is it? When the sample size is large, one can rely on the central limit theorem to use the error bound

The correct interpretation of this error bound is as follows: on average, intervals of the form

contain , the population mean, 95% of the time. (The value 1.96 is obtained from the normal probability distribution and sets the confidence level at 95%). Based on the May precipitation data one can compute the 95% confidence interval of (3.37,4.30). The main goal of this lab is to investigate whether this statistical measure of accuracy is actually sound.



Simulating Data Sets

In order to evaluate how well confidence intervals work, it will be necessary to apply the method to many data sets. Also, in order to judge the validity of the interval, we need to know the value of the population mean. Given these requirements, it should be obvious that we can not just use the precip.may data set since this is just one sample from the population and more importantly, we don't know .

The way around this problem is to choose a probability density function that is similar to the histogram of the observed data. From the probability density function, one can simulate as many samples as needed. Also, since the theoretical distribution is now known, one can work out the precise value for . Based on the data precip.may, one could use a normal distribution with a mean of 3.84 and a standard deviation of 1.6 to approximate the distribution of the observed data. Below is an example of simulating a sample from this distribution and calculating the 95% confidence interval.

rnorm(45,mean = 3.84,sd = 1.6) -> sample
mean(sample) -> m
sd(sample) -> s
m - 1.96*s/sqrt(45) -> ci.lower
m + 1.96*s/sqrt(45) -> ci.upper

You might repeat these steps several times and see whether the interval [ci.lower,ci.upper] actually contains the population mean 3.84. (By the way, if you are not sure why m does not come out to be exactly 3.84, you have missed a basic point and need to reread Section 9.2.)

If you just evaluated a few 95% confidence intervals, it should not come as a surprise that all of them cover . In order to determine how close the frequency coverage is to 95%, it is necessary to simulate many data sets. Obviously, we need some automatic way of generating many samples. Typing the five commands above several thousand times may get a little boring! Instead we have made a special S function simulate.samples to do the repetitions:

simulate.samples(1000,45,DIST=rnorm,mean=3.84,sd=1.6) -> samples
samples$m - 1.96*samples$s/sqrt(45) -> ci.lower
samples$m + 1.96*samples$s/sqrt(45) -> ci.upper

The result is the same as the first example except that now we have confidence intervals from 1000 independent samples in the data sets ci.lower and ci.upper. Note that besides generating the 1000 data sets of 45 observations, simulate.samples also computes the means and standard deviations. These are returned as components with names m and s. The actual data that was simulated is returned in a component named dat. In this case it will be a matrix with 1000 rows and 45 columns. The DIST argument should be the name of the . function that generates random numbers from a particular distribution. So, for example, if you used the command

simulate.samples(100,13,DIST=rgam,alpha=5.76,beta=.667) -> zork

zork$dat would be a matrix of 100 rows and 13 columns, where each number is randomly generated from a gamma distribution (see Lab 6) with and . zork$m would contain the means of these 100 samples, and zork$s would have the standard deviations.


Evaluating the Coverage of Confidence Intervals

What should one do with the 1000 confidence intervals from simulated samples? The first step is to see if they really do have a 95% coverage rate. One way to check this is to have S count the number of times is less than the lower limit and the number of times that is greater than the upper limit. Adding these together gives the number of intervals that fail to contain . For example:

count(3.84 < ci.lower) -> bad.lower
count(ci.upper < 3.84) -> bad.upper
(bad.lower + bad.upper) /1000 -> bad.frac
bad.frac

Of course we expect that bad.frac should be close to .05. Also, at a finer level one wants the bad cases to be symmetric. That is bad.lower and bad.upper should be about equal in number. Here is another way of evaluating the intervals: plot the lower versus the upper limits and then draw horizontal and vertical lines at .

plot(ci.lower,ci.upper,xlab="Lower End of CI",ylab="Upper End of CI")
title("Simulating 1000 Samples of Size 13 from a Gamma Dist.",cex=.8)
xline(3.84)
yline(3.84)

The top plot above implements this code. The lines divide the plot into four regions. The upper left hand region contains the ``good'' intervals, the upper right hand region contains the bad.lower intervals, and the lower left hand region contains the bad.upper intervals.

Another aspect to consider is the average width of the intervals or the distribution of the widths. Even if the intervals contain 95% of the time, if they are very wide, the method would not be useful. To analyze this feature one could use

ci.upper - ci.lower -> width

to get the widths and then either stats(width) or hplot(width) to summarize the distribution of the widths.



On Your Own

  1. Consider a population that is normally distributed with and . How reliable is the ``large'' sample 90% confidence interval when applied to samples of only 4 observations? How reliable is this procedure when a sample has 12 observations? To check the reliability, generate 1000 confidence intervals as in Sec. 7.3 and count the ``bad fraction.'' A reliable 90% confidence interval should have a bad fraction around .10.

  2. Consider a population that follows an exponential distribution with . What is in this case? (If you do not know , just generate a large number and take the mean: rexp(1000,beta=4)->z, mean(z), and guess what must be. mean(z) is not exactly but rather an estimate based on 1000 random variables). How reliable is the large sample confidence interval when the number of observations in a sample is 12? How reliable is it when the size is increased to 30?

  3. Construct the large sample 95% confidence interval for the Raleigh August precipitation data using the results from stats(precip.aug). Carefully interpret this interval in plain English. Also, based on the results of this lab, comment on the validity of this statistical method for this particular data set (a histogram of precip.aug might help after looking back at the solutions of problems 1 and 2 above). Important note: you should only use the output from stats(precip.aug) to construct the interval. Neither simulate.samples nor rnorm should not be used in this problem.

  4. Formulate a question that can be investigated by estimating the mean of a population.(Below are some examples. Try to pick a topic that you are actually interested in.) Carefully collect a random sample of size at least n=10 pertaining to your population and describe the distribution of your observed data (stem and leaf plot, stats, etc.). Construct a confidence interval for the population mean and carefully interpret this interval in words. To read the data into S you can type it in using the read.data function. Be sure to list the data in your report and to explain how is was collected. Give enough detail so that someone would be able to reproduce your sampling plan.

    Examples of questions to investigate: