M-Lab 6: Densities of Random Variables

Random variables are numerical measurements used to describe the results of an experiment or physical system. There are two very useful functions used to specify probabilities for a random variable. These are the probability density function f(x) (also called a probability mass function for discrete random variables) and the cumulative distribution function F(x) (also called the distribution function). The goal of this lab is to introduce these functions and show how some common density functions might be used to describe data.

A powerful feature of the MATLAB statistical package is that one can easily calculate and plot the density and distribution functions for many distributions and also simulate random samples from these distributions. By the end of this lab you should feel comfortable using these functions.

Before we start, here is some terminology: every random variable has a unique distribution. Loosely speaking, the distribution of a random variable X refers to all probabilities of the form tex2html_wrap_inline48, where A is an arbitrary set of possible values of X. These probabilities may be calculated from either its density function or its distribution function. For continuous random variables X and sets of the form A=[a,b], we have

displaymath38

since

displaymath39

Off the Shelf Distributions in MATLAB

MATLAB has a wide variety of probability density functions for continuous random variables and probability mass functions for discrete random variables. See examples for a quick overview of how to call some specific functions in MATLAB.

Now, let's look at the exponential distribution. First put 201 equally spaced points in the range 0 to 5 into x:

x=0:.025:5;

and evaluate the standard exponential probability density function , at the points in the data set x. Put the results in the data set yd:

yd=exppdf(x);

Notice the syntax: "exp" for "exponential" and "pdf" for "probability density function." Next let's evaluate the standard exponential cumulative distribution function , at x and put the results in yp (notice that cdf replaces pdf):

yp=expcdf(x);

Let's now compare these functions by plotting them

subplot(2,1,1)
plot(x,yd)
title('Standard Exponential Density Function')
subplot(2,1,2)
plot(x,yp)
title('Standard Exponential Distribution Function')

The exponential density and distribution functions are related by the fact that the second is just one minus the first. This is a very special case. To see a different distribution, go through the above sequence with norm (for the normal distribution) replacing exp. First get 201 equally spaced points in the range -3 to 3 (why do we need a different range?)

xn= -3:.03:3;
ydn= normpdf(xn);
subplot(2,1,1)
plot(xn,ydn)
title('Standard Normal Density Function')
subplot(2,1,2)
plot(xn,normcdf(xn))
title('Standard Normal Distribution Function')

The standard exponential and standard normal distributions that we have plotted are just members of the exponential and normal families of distributions. The general exponential density is

where the parameter specifies the mean (and the scale) of the distribution. The default for exppdf is ; so that above we actually plotted the standard exponential with mean=1. Now let's compare the standard exponential with one having mean=4:

yd4=exppdf(x,4);

Replot the standard exponential density function and then the exponential with mean=4:

subplot(2,1,1)
plot(x,yd)
subplot(2,1,2)
plot(x,yd4)

How do the two densities differ?

The normal family of distributions actually has two parameters, and the density is

where is the mean and is the standard deviation. Try plotting a few of these:

First get 201 equally spaced points in the range -5 to 5

xn =-5:.05:5;

and plot the standard normal:

subplot(3,1,1)
plot(xn,normpdf(xn,0,1))

Next with a mean of 2 and -2:

subplot(3,1,2)
plot(xn,normpdf(xn,2,1))
subplot(3,1,3)
plot(xn,normpdf(xn,-2,1))

Now do some similar plots with the sigma argument changing instead of the mean.

Below is a of list of some distributions available in MATLAB along with their abbreviations and the names of the parameters. Remember that when the parameters are not specified, you get the standard distribution for the family. The standard values for the parameters are indicated in the table with equal signs. Sometimes there is not a default choice. For instance, for the binomial you must specify N, and for Student's t and the you must specify df.

    Parameters  
Distribution M Function (with default values) Example Usage
binomial bino N, p (no default) binopdf(x,20,.7)
chisquared chi2 df (no default) chi2pdf(x,4)
exponential exp beta=1 exppdf(x,3)
gamma gam alpha=1 (no default for beta) gampdf(x,2,3)
normal norm mean=0, sd=1 normpdf(x,2,5)
Poisson poiss lambda (no default) poisspdf(x,3)
Student's t t df (no default) tpdf(x,7)
uniform unif min=0,max=1 unifpdf(x,5,20)
Weibull weib alpha,beta (no default) weibpdf(x,4,2)


Probability Functions for Discrete Random Variables

The binomial and Poisson are the most important discrete random variables. Looking in the table above, we see that binopdf and poisspdf are the MATLAB functions for the probability mass functions of these random variables. To illustrate, let us first plot the probability mass function of a binomial random variable with success probability .2:

x=0:10;
y=binopdf(x,10,.2);
clf
stem(x,y)
axis ([-1 11 0 .35])

We have plotted this function with the stem function to emphasize that it only takes positive values at the points 0,1,, ..., n. The axis command just makes the plot look a little better.

The cumulative distribution function is useful for computing probabilities that a binomial random variable lies in a range. For example, suppose we needed to compute the probability that a binomial random variable X with success probability .2 lies between 2 and 8. Actually, that is

Pr(2 < X < 8) = Pr(2 < X tex2html_wrap_inline25 7) = Pr( X tex2html_wrap_inline25 7) - Pr( X tex2html_wrap_inline25 2)

In MATLAB we just type

>> binocdf(7,10,.2)-binocdf(2,10,.2)

ans =

    0.3221
The graph of the cumulative distribution function is not very interesting. Nevertheless, a plot is given by

x=0:10;
y=binocdf(x,10,.2);
stairs(x,y)
axis ([-1 11 0 1.1])

Generating Random Data

It is useful to generate random variables from a specific distribution. In MATLAB, we only need to add "rnd" (for random) to any of the distribution names in the above table to generate data from that distribution. For example, to generate a sample of 6 independent random variables from the normal distribution with mean 3 and standard deviation 2, we would type:

normrnd(3,2,[1 6])

yielding

ans =

    5.1900   -0.7480    3.8564    4.7913    4.4619    4.1557
If one prefers a column vector, then type

normrnd(3,2,[6 1])

Note that each time you type these commands you get a different data set. It is possible to set the random number generator so that you can get the same random variables each time. (For the normal distibution, the command randn('state',j) sets the seed to j. For the other distributions, use rand('state',j).)

Now let's demonstrate that the random number generator really works. We first generate 500 normal random variables and make a histogram:

randn('state',255)
z=normrnd(3,2,[500 1]);
clf
hplot(z)

The function hplot just makes a histogram that has clear bars, and also the area in the bars sums to 1 like a probability density function. Now let's overlay a normal density function on top of the histogram:

x=-5:.1:10;
y=normpdf(x,3,2);
hold on
plot(x,y)

Let's try the same thing with a Weibull distribution:

rand('state',210)
z=weibrnd(1,2,[500 1]);
clf
hplot(z)
x=0:.1:3;
y=weibpdf(x,1,2);
hold on
plot(x,y)

Notice that the Weibull density is skewed to the right (tail to the right).

Matching a Density to Data

In the last section we showed how to generate data from a distribution and then overlay a density function over the histogram of the data. A more important application is to try to match a density to a histogram when the true answer is not known. For example, the data set precip.feb is the mean February precipitation for Raleigh for the past 45 years. Looking at the histogram for these data suggests a bell-curve shape; so we might try to match this with a normal density. One hint for working with the normal is that the density will always peak at the mean value and the density will be essentially zero outside the interval given by mean ± 3 standard deviations.

So first lets make our histogram:

hplot(precip.feb)

Since the middle is around 3.5 and the range is aound 6, let's try overlaying a normal density with mean 3.5 and standard deviation 2 (since 6/3=2).

x=0:.1:7;
y=normpdf(x,3.5,2);
hold on
plot(x,y)

The plotted density is not peaked enough. So let's a standard deviation of 1 instead:

y=normpdf(x,3.5,1);
plot(x,y)

That standard deviation seems too small. Let's try 1.5:

y=normpdf(x,3.5,1.5);
plot(x,y)

This density seems to fit pretty well. You might have noticed that these values are close to the sample mean and standard deviation for this data set. In general, there are better ways to fit a density to sample data than this trial and error method with plotting that we have just carried out. However, this plotting method should convey the essential idea of fitting a model density to sample data.


NOTE

The parameter that we used in class to define the exponential distribution is lambda (the rate), MATLAB uses beta as the parameter, where beta is the mean of the exponential distribution (the mean time).
The relationship between lambda and beta is beta = 1/lambda.
To estimate the beta parameter (population mean) use the sample mean of your observations.

On Your Own

In working out solutions to the following problem please limit your answers to three pages. Any plots or other computer output should be neatly organized in an appendix.

  1. The MATLAB data set lab4dat was created by simulating random numbers from one of the MATLAB functions. Find a density function from the table in the body of the lab (and the corresponding parameter values) that best match the distribution of these data.
    Hint: make a histogram of the data, guess a density family, and try overlaying particular members of that family. Here is an example:

    clf
    hplot(lab4dat)
    x= 0:.05:2;
    y= normpdf(x,.4,.2);
    hold on
    plot(x,y)

    (Hint: The normal density family is not the right family. Also note that we use hplot instead of hist because hplot is in the right scale for overlaying densities.)

  2. The data sets precip.may and precip.aug are the monthly average precipitation recordings for Raleigh from 1948 to 1992. Try to find the best densities that describe the distributions of each of these months.
    Hint: we often start with normal densities, but for the August data it might be better to try out a few other densities such as the Weibull or gamma. Try different values of the parameters till you find a reasonable fit, always taking into account the meaning of the parameters that define each distribution.

  3. Sometimes the histograms for a data set are sensitive to the choice of intervals. To see how much a histogram can change depending on the intervals used, type hist(ftptime,5), then replot with hist(ftptime,6), etc., through hist(ftptime,12). What family of density functions would you try first if you wanted to match up a smooth density to the histogram? The data in ftptime are 40 ftp times for a file of 343285 bytes which was repeatedly obtained from a site in California. (ftp is a standard protocol for obtaining files on the Internet.)

  4. When sample sizes are small, histograms may not give a clear picture of the type of density that the data belongs to. Since normal distributions are by far the most important family of distributions, data analysts usually try to first see if a normal distribution is appropriate. If not, they can then try other densities. A graphical technique to decide about approximate normality is the Q-Q plot. Basically it is a plot of the data's percentiles (quantiles) against the theoretical percentiles of a normal distribution. If the data are approximately normal, then the plot will be approximately linear. To see such a plot, type normplot(ftptime). To ``calibrate'' your eye for such plots, generate the following three data sets (with 50 observations each) and make Q-Q normal plots.

    x1=normrnd(0,1,[50 1]);
    x2=exprnd(1,[50 1]);
    x3=trnd(3,[50 1]);

    Interpret your Q-Q normal plots.

    Hint:The exponential density (exprnd) is aymmetric (skewed to the right) and the t distribution with three degrees of freedom (trnd(3,[50 1])) is symmetric but has large outlying values much more frequently than a standard (mu=0, sd=1) normal distribution (normrnd).