Random variables are numerical measurements used to describe the results of an experiment or physical system. There are two very useful functions used to specify probabilities for a random variable. These are the probability density function f(x) (also called a probability mass function for discrete random variables) and the cumulative distribution function F(x) (also called the distribution function). The goal of this lab is to introduce these functions and show how some common density functions might be used to describe data.
A powerful feature of the MATLAB statistical package is that one can easily calculate and plot the density and distribution functions for many distributions and also simulate random samples from these distributions. By the end of this lab you should feel comfortable using these functions.
Before we start, here is some terminology: every
random variable has a unique distribution. Loosely speaking, the
distribution of a random variable X refers to all probabilities of the form
,
where A is an arbitrary set of possible values of X.
These probabilities may be calculated from
either its density function or its distribution function. For continuous
random variables X and sets of the form A=[a,b], we have
since
MATLAB has a wide variety of probability density functions for continuous random variables and probability mass functions for discrete random variables. See examples for a quick overview of how to call some specific functions in MATLAB.
Now, let's look at the exponential distribution. First put 201 equally spaced points in the range 0 to 5 into x:
x=0:.025:5;
and evaluate the standard exponential probability density function
,
at the points in the data set
x.
Put the results in the data set
yd:
yd=exppdf(x);
Notice the syntax: "exp" for "exponential" and "pdf" for "probability
density function."
Next let's evaluate the standard exponential cumulative distribution
function
,
at x
and put the results in
yp
(notice that cdf replaces pdf):
yp=expcdf(x);
Let's now compare these functions by plotting them
subplot(2,1,1)
plot(x,yd)
title('Standard Exponential Density Function')
subplot(2,1,2)
plot(x,yp)
title('Standard Exponential Distribution Function')
The exponential density and distribution functions are related by the fact that the second is just one minus the first. This is a very special case. To see a different distribution, go through the above sequence with norm (for the normal distribution) replacing exp. First get 201 equally spaced points in the range -3 to 3 (why do we need a different range?)
xn= -3:.03:3;
ydn= normpdf(xn);
subplot(2,1,1)
plot(xn,ydn)
title('Standard Normal Density Function')
subplot(2,1,2)
plot(xn,normcdf(xn))
title('Standard Normal Distribution Function')
The standard exponential and standard normal distributions that we have plotted are just members of the exponential and normal families of distributions. The general exponential density is
where the parameter
specifies the mean (and the scale) of the distribution.
The default for exppdf is
; so that above we actually
plotted the standard exponential with mean=1. Now let's compare
the standard exponential with one having mean=4:
yd4=exppdf(x,4);
Replot the standard exponential density function and then the exponential with mean=4:
subplot(2,1,1)
plot(x,yd)
subplot(2,1,2)
plot(x,yd4)
How do the two densities differ?
The normal family of distributions actually has two parameters, and the density is
where
is the mean and
is the standard deviation. Try
plotting a few of these:
First get 201 equally spaced points in the range -5 to 5
xn =-5:.05:5;
and plot the standard normal:
subplot(3,1,1)
plot(xn,normpdf(xn,0,1))
Next with a mean of 2 and -2:
subplot(3,1,2)
plot(xn,normpdf(xn,2,1))
subplot(3,1,3)
plot(xn,normpdf(xn,-2,1))
Now do some similar plots with the sigma argument changing instead of the mean.
Below is a of list of some distributions available in MATLAB
along with their
abbreviations and the names of the parameters. Remember that when the
parameters are not specified, you get the standard distribution
for the family. The standard values for the parameters are indicated in
the table with equal signs. Sometimes there is not a default choice.
For instance, for the binomial you must specify
N,
and for Student's t and the
you must specify df.
| Parameters | |||
| Distribution | M Function | (with default values) | Example Usage |
| binomial | bino | N, p (no default) | binopdf(x,20,.7) |
| chisquared | chi2 | df (no default) | chi2pdf(x,4) |
| exponential | exp | beta=1 | exppdf(x,3) |
| gamma | gam | alpha=1 (no default for beta) | gampdf(x,2,3) |
| normal | norm | mean=0, sd=1 | normpdf(x,2,5) |
| Poisson | poiss | lambda (no default) | poisspdf(x,3) |
| Student's t | t | df (no default) | tpdf(x,7) |
| uniform | unif | min=0,max=1 | unifpdf(x,5,20) |
| Weibull | weib | alpha,beta (no default) | weibpdf(x,4,2) |
The binomial and Poisson are the most important discrete random variables. Looking in the table above, we see that binopdf and poisspdf are the MATLAB functions for the probability mass functions of these random variables. To illustrate, let us first plot the probability mass function of a binomial random variable with success probability .2:
x=0:10;
y=binopdf(x,10,.2);
clf
stem(x,y)
axis ([-1 11 0 .35])
We have plotted this function with the stem function to emphasize that it only takes positive values at the points 0,1,, ..., n. The axis command just makes the plot look a little better.
The cumulative distribution function is useful for computing probabilities that a binomial random variable lies in a range. For example, suppose we needed to compute the probability that a binomial random variable X with success probability .2 lies between 2 and 8. Actually, that is
Pr(2 < X < 8) = Pr(2 < X
7)
= Pr( X
7)
- Pr( X
2)
In MATLAB we just type
>> binocdf(7,10,.2)-binocdf(2,10,.2)
ans =
0.3221
The graph of the cumulative distribution function is not very interesting.
Nevertheless, a plot is given by
x=0:10;
y=binocdf(x,10,.2);
stairs(x,y)
axis ([-1 11 0 1.1])
normrnd(3,2,[1 6])
yielding
ans =
5.1900 -0.7480 3.8564 4.7913 4.4619 4.1557
If one prefers a column vector, then type
normrnd(3,2,[6 1])
Note that each time you type these commands you get a different data set. It is possible to set the random number generator so that you can get the same random variables each time. (For the normal distibution, the command randn('state',j) sets the seed to j. For the other distributions, use rand('state',j).)
Now let's demonstrate that the random number generator really works. We first generate 500 normal random variables and make a histogram:
randn('state',255)
z=normrnd(3,2,[500 1]);
clf
hplot(z)
The function hplot just makes a histogram that has clear bars, and also the area in the bars sums to 1 like a probability density function. Now let's overlay a normal density function on top of the histogram:
x=-5:.1:10;
y=normpdf(x,3,2);
hold on
plot(x,y)
Let's try the same thing with a Weibull distribution:
rand('state',210)
z=weibrnd(1,2,[500 1]);
clf
hplot(z)
x=0:.1:3;
y=weibpdf(x,1,2);
hold on
plot(x,y)
Notice that the Weibull density is skewed to the right (tail to the right).
In the last section we showed how to generate data from a distribution and then overlay a density function over the histogram of the data. A more important application is to try to match a density to a histogram when the true answer is not known. For example, the data set precip.feb is the mean February precipitation for Raleigh for the past 45 years. Looking at the histogram for these data suggests a bell-curve shape; so we might try to match this with a normal density. One hint for working with the normal is that the density will always peak at the mean value and the density will be essentially zero outside the interval given by mean ± 3 standard deviations.
So first lets make our histogram:
hplot(precip.feb)
Since the middle is around 3.5 and the range is aound 6, let's try overlaying a normal density with mean 3.5 and standard deviation 2 (since 6/3=2).
x=0:.1:7;
y=normpdf(x,3.5,2);
hold on
plot(x,y)
The plotted density is not peaked enough. So let's a standard deviation of 1 instead:
y=normpdf(x,3.5,1);
plot(x,y)
That standard deviation seems too small. Let's try 1.5:
y=normpdf(x,3.5,1.5);
plot(x,y)
This density seems to fit pretty well. You might have noticed
that these values are close
to the sample mean and standard deviation for this data set. In general,
there are better ways to fit a density to sample data than this trial and
error method with plotting that we have just carried out.
However, this plotting method should convey the essential idea of fitting a
model density to sample data.
In working out solutions to the following problem please limit your answers to three pages. Any plots or other computer output should be neatly organized in an appendix.
clf
hplot(lab4dat)
x= 0:.05:2;
y= normpdf(x,.4,.2);
hold on
plot(x,y)
(Hint: The normal density family is not the right family. Also note that we use hplot instead of hist because hplot is in the right scale for overlaying densities.)
x1=normrnd(0,1,[50 1]);
x2=exprnd(1,[50 1]);
x3=trnd(3,[50 1]);
Interpret your Q-Q normal plots.
Hint:The exponential density (exprnd) is aymmetric (skewed to the right) and the t distribution with three degrees of freedom (trnd(3,[50 1])) is symmetric but has large outlying values much more frequently than a standard (mu=0, sd=1) normal distribution (normrnd).