Lab 6: Densities of Random Variables

Random variables are numerical measurements used to describe the results of an experiment or physical system. There are two very useful functions used to specify probabilities for a random variable. These are the density function f(x) and the distribution function F(x). The goal of this lab is to introduce these functions and show how some common density functions might be used to describe data.

A powerful feature of the S statistical package is that one can easily calculate and plot the density and distribution functions for many distributions and also simulate random samples from these distributions. By the end of this lab you should feel comfortable using these functions.

Before we start, here is some terminology: every random variable has a unique distribution. Loosely speaking, the distribution of a random variable X refers to all probabilities of the form tex2html_wrap_inline48, where A is an arbitrary set of possible values of X. These probabilities may be calculated from either its density function or its distribution function. For continuous random variables X and sets of the form A=[a,b], we have

displaymath38

since

displaymath39

Off the Shelf Distributions in S

The functions in S that work with distributions have the form xabbr where x is any of the letters d,p,r (d standing for density, p for probability which is really the distribution function, and r for random), and abbr is the abbreviation for the name of the random variable. To get the idea of this system, let's go through some functions for the exponential distribution:

seq(0,5,,200) -> x
First put 200 equally spaced points in the range 0 to 5.

dexp(x) -> yd
Evaluate the standard exponential density function, , at the points in the data set x. Put the results in the data set yd.

pexp(x) -> yp
Evaluate the standard exponential distribution function, , at the values in the data set x.

Let's now compare these functions:

set.panel(2,1)
Set the plot window to have 2 plots, one above the other.

plot(x,yd)
Plot the exponential density function.

title("Exponential Density Function")
Add title. Compare your output with the plot above.

plot(x,yp)
Plot the exponential distribution function.

title("Exponential Distribution Function")

The exponential density and distribution functions are related by the fact that the second is just one minus the first. This is a very special case. To see a different distribution, go through the above sequence with norm (for the normal distribution) replacing exp:

seq(-3,3,,200) -> xn
First get 200 equally spaced points in the range -3 to 3 (why do we need a different range?)

dnorm(xn) -> ydn

plot(xn,ydn)

plot(xn,pnorm(xn))
Note the shortcut here compared to the previous two commands.
The standard exponential and standard normal distributions that we have plotted are just members of the exponential and normal families of distributions. The general exponential density is

where the parameter specifies the mean (and the scale) of the distribution. The default for dexp is , so that above we actually plotted the standard exponential with mean=1. Now let's compare the standard exponential with one having mean=4:

dexp(x,beta=4) -> yd4
Note that we have to spell out as beta. plot(x,yd) First replot the standard exponential density function.

plot(x,yd4) How do the two densities differ?

The normal family of distributions actually has two parameters, and the density is

where is the mean and is the standard deviation. Try plotting a few of these:

set.panel(3,1)
seq(-5,5,,200) -> xn
plot(xn,dnorm(xn,mean=0,sd=1))
plot(xn,dnorm(xn,mean=2,sd=1))
plot(xn,dnorm(xn,mean=-2,sd=1))

Now do some similar plots with the sd argument changing instead of the mean.

Below is a of list of some distributions available in S along with their abbreviations and the names of the parameters. Remember that when the parameters are not specified, you get the standard distribution for the family. The standard values for the parameters are indicated in the table with equal signs. Sometimes there is not a default choice. For the binomial you must specify N, and for Student's t and the you must specify df. . will let you know pretty quickly if you have left something out that is needed. Try typing demo.gamma() to see the variety of shapes you can get from the gamma family of distributions.


Matching a Density to Data

When a large number of random samples are generated from a particular distribution, the histogram (suitably scaled ) will tend to look like the density function. For example, generate 500 uniformly distributed random variables and look at their histogram:

runif(500) -> look.unif
hplot(look.unif)

Just as a reference you can add the true density function on top of this:

seq(0,1,,50) -> x
dunif(x) -> y
lines(x,y)

To get a better feeling for the type of random variation to expect in such a plot, try the following command:

lab4.demo(50,nclass=6)

Try it again with 50 changed to 200 or to some other sample size.

A more important application is to try to match a density to a histogram when the true answer is not known. For example, the data set precip.feb is the mean February precipitation for Raleigh for the past 45 years. Looking at the histogram for these data suggests a bell-curve shape; so we might try to match this with a normal density. One hint for working with the normal is that the density will always peak at the mean value and the density will be essentially zero

outside the interval mean 3 sd. (What are reasonable ranges for plotting the exponential density?) Based on experimenting with different normal densities, we found that the one with mean = 3.5 and sd = 1.5 seems to work pretty well. See the above plot.



More about Making Histograms

There are several way of scaling the y axis of a histogram. The class hplot command gives a density function scale for the left axis but includes the actual number of data points in an interval on the right axis. The left axis is important so that we can overlay a theoretical density to see how well things match up. The right axis is the more traditional scale for a histogram, especially when creating a histogram by hand. We feel it is useful to get both scales in one plot.

It is our experience that the hplot function chooses too few intervals (bars) when the data set is large. To specify a different number of intervals, include the argument nclass=M where M is the number of intervals you would like. For example,

hplot(look.unif,nclass=10)

Remember that the histogram is not a very sharp estimate of the density function. You should make some allowances for the staircase shape and the variability in heights due to the finite sample size (recall the plots which come from lab4.demo).



On Your Own

In working out solutions to the following problem please limit your answers to three pages. Any plots or other computer output should be neatly organized in an appendix.

  1. The S data set lab4.dat was created by simulating random numbers from one of the S functions. Try to find a density function (and the corresponding parameter values) that best match the distribution of these data. Hint: make a histogram of the data, guess a density family, and try overlaying particular members of that family. Here is an example:

    hplot(lab4.dat,col=0)
    seq(0,5,,50) -> x
    dnorm(x,mean=.4,sd=.2) -> y
    lines(x,y)

  2. The data sets precip.may and precip.aug are the monthly average precipitation recordings for Raleigh from 1948 to 1992. Try to find the best densities that describe the distributions of each of these months. Hint: we often start with normal densities, but for the August data it might be better to try out a few other densities such as the Weibull or gamma.

  3. Sometimes the histograms for a data set are sensitive to the choice of intervals. To see how much a histogram can change depending on the intervals used, type hplot(time.ftp,nclass=5), then replot with nclass=6, etc., through nclass=12. Put them all on the same page by first typing set.panel(4,2). What family of density functions would you try first if you wanted to match up a smooth density to the histogram? The data in time.ftp are 40 ftp times for a file of 343285 bytes which was repeatedly obtained from a site in California. (ftp is a standard protocol for obtaining files on the Internet.)

  4. When sample sizes are small, histograms may not give a clear picture of the type of density that the data belongs to. Since normal distributions are by far the most important family of distributions, data analysts usually try to first see if a normal distribution is appropriate. If not, they can then try other densities. A graphical technique to decide about approximate normality is the Q-Q plot. Basically it is a plot of the data's percentiles (quantiles) against the theoretical percentiles of a normal distribution. If the data are approximately normal, then the plot will be approximately linear. To see such a plot, type qqnorm(time.ftp) followed by qqline(time.ftp) which adds a straight line. To ``calibrate'' your eye for such plots, generate the following three data sets and make Q-Q normal plots.

    rnorm(50) -> x1
    rexp(50) -> x2
    rt(50,3) -> x3

    The exponential density (rexp) is aymmetric (skewed to the right) and the t distribution with three degrees of freedom (rt(50,3)) is symmetric but has large outlying values much more frequently than a normal distribution.