Random variables are numerical measurements used to describe the results of an experiment or physical system. There are two very useful functions used to specify probabilities for a random variable. These are the density function f(x) and the distribution function F(x). The goal of this lab is to introduce these functions and show how some common density functions might be used to describe data.
A powerful feature of the S statistical package is that one can easily calculate and plot the density and distribution functions for many distributions and also simulate random samples from these distributions. By the end of this lab you should feel comfortable using these functions.
Before we start, here is some terminology: every
random variable has a unique distribution. Loosely speaking, the
distribution of a random variable X refers to all probabilities of the form
,
where A is an arbitrary set of possible values of X.
These probabilities may be calculated from
either its density function or its distribution function. For continuous
random variables X and sets of the form A=[a,b], we have
since
The functions in S that work with distributions have the form xabbr where x is any of the letters d,p,r (d standing for density, p for probability which is really the distribution function, and r for random), and abbr is the abbreviation for the name of the random variable. To get the idea of this system, let's go through some functions for the exponential distribution:
Let's now compare these functions:
The exponential density and distribution functions are related by the fact that the second is just one minus the first. This is a very special case. To see a different distribution, go through the above sequence with norm (for the normal distribution) replacing exp:
where the parameter
specifies the mean (and the scale)
of the distribution.
The default for dexp is
, so that above we actually
plotted the standard exponential with mean=1. Now let's compare
the standard exponential with one having mean=4:
plot(x,yd4) How do the two densities differ?
where
is the mean and
is the standard deviation. Try
plotting a few of these:
set.panel(3,1)
seq(-5,5,,200) -> xn
plot(xn,dnorm(xn,mean=0,sd=1))
plot(xn,dnorm(xn,mean=2,sd=1))
plot(xn,dnorm(xn,mean=-2,sd=1))
Now do some similar plots with the sd argument changing instead of the mean.
Below is a of list of some distributions available in S along with their
abbreviations and the names of the parameters. Remember that when the
parameters are not specified, you get the standard distribution
for the family. The standard values for the parameters are indicated in
the table with equal signs. Sometimes there is not a default choice.
For the binomial you must specify N,
and for Student's t and the
you must specify df.
. will let you know pretty quickly
if you have left something out that is needed. Try typing demo.gamma() to see
the variety of shapes you can get from the gamma family of distributions.
When a large number of random samples are generated from a particular distribution, the histogram (suitably scaled ) will tend to look like the density function. For example, generate 500 uniformly distributed random variables and look at their histogram:
runif(500) -> look.unif
hplot(look.unif)
Just as a reference you can add the true density function on top of this:
seq(0,1,,50) -> x
dunif(x) -> y
lines(x,y)
To get a better feeling for the type of random variation to expect in such a plot, try the following command:
lab4.demo(50,nclass=6)
Try it again with 50 changed to 200 or to some other sample size.
A more important application is to try to match a density to a histogram when the true answer is not known. For example, the data set precip.feb is the mean February precipitation for Raleigh for the past 45 years. Looking at the histogram for these data suggests a bell-curve shape; so we might try to match this with a normal density. One hint for working with the normal is that the density will always peak at the mean value and the density will be essentially zero
outside the interval mean
3 sd. (What are reasonable
ranges for plotting the exponential density?) Based on experimenting
with different normal densities, we found that the one with mean =
3.5 and sd = 1.5 seems to work pretty well. See the above plot.
There are several way of scaling the y axis of a histogram. The class hplot command gives a density function scale for the left axis but includes the actual number of data points in an interval on the right axis. The left axis is important so that we can overlay a theoretical density to see how well things match up. The right axis is the more traditional scale for a histogram, especially when creating a histogram by hand. We feel it is useful to get both scales in one plot.
It is our experience that the hplot function chooses too few intervals (bars) when the data set is large. To specify a different number of intervals, include the argument nclass=M where M is the number of intervals you would like. For example,
hplot(look.unif,nclass=10)
Remember that the histogram is not a very sharp estimate of the density function. You should make some allowances for the staircase shape and the variability in heights due to the finite sample size (recall the plots which come from lab4.demo).
In working out solutions to the following problem please limit your answers to three pages. Any plots or other computer output should be neatly organized in an appendix.
hplot(lab4.dat,col=0)
seq(0,5,,50) -> x
dnorm(x,mean=.4,sd=.2) -> y
lines(x,y)
rnorm(50) -> x1
rexp(50) -> x2
rt(50,3) -> x3
The exponential density (rexp) is aymmetric (skewed to the right) and the t distribution with three degrees of freedom (rt(50,3)) is symmetric but has large outlying values much more frequently than a normal distribution.