This first lab will give some practice using MATLAB to manipulate data sets, calculate statistics, and make plots. To get set up, go to Getting Started. To exit MATLAB, just type quit.
The last setup command was to load the ST370 data sets:
load st370data
This command tells MATLAB to look for a file named st370data.mat
in one of the
directories in its path, and if found, bring its contents into memory.
(By the way, if you type
path,
the last directory in the displayed list
contains st370data.mat.)
This particular .mat file just makes available the variables
containing the data you are going to be using in M-Lab. (MATLAB uses the
name "variables" for a variety of data structures. We prefer not to use
that name in general because "variables" has a more specific meaning in
statistics.)
To learn more about MATLAB, just type helpdesk. This will bring up a Netscape window with a variety of tutorials. At minimum you should click through "Getting Started."
To find out which data sets are defined at any given time in MATLAB (either before or after you add the ST370 data sets), you can use the who command, or for more detail, whos. So type who and whos to see what they produce. For example
>> whos
Name Size Bytes Class
actuator 1x1 7512 struct array
airplane 1x1 688 struct array
ans 1x50 400 double array
cancer 1x1 648 struct array
capac 1x1 3070 struct array
cavendish 29x1 232 double array
climate 1x1 8700 struct array
college 1x1 4354 struct array
. . . . .
. . . . .
. . . . .
Notice that most of the objects listed
have the name "struct array." These are actually vectors or matrices
of numbers or characters called structure arrays,
but they are different from regular MATLAB matrices.
To look at a regular vector or matrix type data set,
just type its name. For example, type
cavendish
because it is a regular vector of length 29 (a matrix of dimension
29x1). Next try typing the structure array
climate :
>> climate
climate =
lat: [50x1 double]
jan: [50x1 double]
rain: [50x1 double]
city: {50x1 cell}
jul: [50x1 double]
elev: [50x1 double]
lon: [50x1 double]
This data set contains some climate and geographical information on the 50
largest US cities. But since it is a "structure array" instead of a matrix, typing
its names only lists what is in it. If you want to look at the whole
structure array at once, then use
viewdata(climate) or
viewdata(climate,n) for the first n rows:
>> viewdata(climate,5) Obs lat jan rain city jul elev lon 1 35.0833 22.3 8.12 albuquerque 92.8 5300 -106.6500 2 33.7500 32.6 48.61 atlanta 87.9 1034 -84.3833 3 30.2833 38.8 31.50 austin 95.4 570 -97.7500 4 39.2833 24.3 43.39 baltimore 87.1 155 -76.6167 5 42.1600 22.8 43.81 boston 81.8 10 -70.6002Notice that the component climate.city is a vector of characters instead of numbers. The component climate.rain is rainfall in inches for these cities. You can print it on the screen by typing
One of our most useful functions is stats:
>> stats(climate.rain) N 50.0000 Mean 31.5970 Std. Dev. 13.6564 Q1 19.2900 Median 32.5550 Q3 40.4300 Min 7.1100 Max 59.7400 Range 52.6300We could have obtained those same values by typing individually length(z) (recall that z=climate.rain), mean(z), std(z), quantile(z,.25), quantile(z,.50), quantile(z,.75), min(z), max(z), and max(x)-min(x). It's usually simpler to just use stats(z) unless you need to save the output from one of these functions as another variable.
Before going further let us take a very small data set and calculate by hand
the mean, standard deviation (Std. Dev. from
stats),
and the median. First create the data set
test
as follows:
test = [9 4 6 2 15]
Just for fun, use the single quote to look at the vector's transpose:
test'
Thus
test
is a row vector and
test'
is a column vector.
The mean can be calculated by summing the terms in the vector and dividing
the sum by the number of items in the vector:
(test(1) + test(2) + test(3) + test(4) + test(5)) / 5
To find the median, it helps to look at a sorted list of data points:
sort(test)
The median of course is the middle value 6 in the sorted list.
We can find the standard deviation by taking the sums of the squares of
each data point's difference from the mean, dividing by one less than the
number of data points, and taking the square root:
sqrt(((test - mean(test)) * (test - mean(test))') / (5 - 1))
Here we took advantage of the way MATLAB handles subtraction when one operand
is a vector and the other is a scalar (the scalar is subtracted from each
member of the vector; this will come in handy later), and the fact that a
vector multiplied by its transpose is equal to the sum of the squares of
its elements. We could also have squared each element of the vector of
means by using the .^ operator, which would have allowed us to
calculate the standard deviation this way:
sqrt(sum((test - mean(test)) .^ 2) / (5 - 1))
Now, if we subtract out the means and divide by the standard deviation, we
should get a normalized data set:
normalrain = (climate.rain - mean(climate.rain))/std(climate.rain)
std(normalrain)
mean(normalrain)
A simple plot similar to the stem-and-leaf plot is a histogram, which we produce
using the
hist function:
hist(climate.rain)
The default is to divide the data into 10 groups. To change the default, just
pass the desired number of groups to hist() as the second argument:
hist(climate.rain, 15)
Note that this plot erases the old one!
If we just wanted multiple plots with each in its own window, we can use the
figure
command to create a new plot window and use it instead:
hist(climate.jan,6)
figure
hist(climate.jul,6)
Alternatively, you can put the plots on the same page
with the subplot
command.
subplot(m,n,p) divides the current figure into m rows
of n columns of plots, and then the next command will plot the pth one,
counting from left to right. For example,
subplot(2,1,1)
hist(climate.jan,6)
subplot(2,1,2)
hist(climate.jul,6)
It looks better if you resize the figure with your mouse to be taller than
it is wide. And if you print, click page setup and activate
"Match Figure Screen Size" to make the figure
fill the whole screen. Alternatively, you can type
set(gcf,'PaperPositionMode','auto');
to have it print the size that you see. (Also, when you print, you will not see the individual bars of the histogram because they will print as black--one solution is to type colormap cool before printing.)
Now, let us plot the latitude vs. the minimum January temperature variables:
figure
plot(climate.lat,climate.jan)
Having lines connecting the data points does not make sense here; so let us
clear the window and try again, this time using only '*' to mark data points:
clf
plot(climate.lat,climate.jan,'*')
If you prefer some space around the edges, you can change the limits
of the x and y axis with
axis([15 55 -3 75])
We can also add labels to the axes and a title:
xlabel('Latitude')
ylabel('Min. Jan. Temperature')
title('50 Largest US Cities')
What are the two cities that do not seem to follow the linear trend?
Put the names of the cities on their plot points with
text(climate.lat,climate.jan,climate.city)
For further help in plot, type
help plot .
If you have a short data set, the easiest way to enter data is
like we did with the test data set earlier
Often, though, we will want to create a structure array that has
names for the data columns and that also allows character values. For
this we have created two functions
read and
readfile.
Consider a data set
You will them be prompted to type in the names of the variables; in this
case we typed
To create a structure array named B, just type
Several other useful functions are
struct2file
(creates a text file from a struct),
save
(saves data files created for future use to be recalled using
load),and
diary
(saves all the screen text from a MATLAB session). Just use the
help
function to learn about these when needed.
help mlab
will display the most important functions used in M-lab. Or you can always
have a netscape window open to the appendix listing
of the functions.
Don't forget, to exit MATLAB, type
quit.
Reading in Your Own Data
test = [9 4 6 2 15];
This produces a simple row vector. Typing in a short matrix is similar.
Here is the example from the HelpDesk's Getting Started:
A = [16 3 2 13; 5 10 11 8; 9 6 7 12; 4 15 14 1]
A =
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
If you had A stored as a file in the directory from which you started
MATLAB, you could also just type
load A
. This creates a matrix A in MATLAB just like the above.
185 72 a
176 69 a
192 74 b
146 66 b
where the columns are weight, height, and a character variable group.
Suppose that this data set is stored in the directory from which you started
MATLAB under the name data1. Then, to get this into a structure array
named B, just type
B=readfile
You will them be prompted to type in the name of the file; in this
case we typed
data1
weight height group
Sometimes we don't have the data stored in a file. The function
read
allows one to just paste data directly into the MATLAB window. This
is especially useful when doing homework assignments from WebAssign.
(Important note: if you are using MATLAB version 6.5, then
read
does not work and you will always have to use
readfile
after creating a text file of the data.)
B=read
You will them be prompted to type in the names of the variables; as above
we typed
weight height group
Finally you will be prompted to
cut and paste the data directly
into the MATLAB command window. To make sure you understand,
do this from the above listing and hit return. Type
viewdata(A)
to see the result.
On Your Own
Present the answers to these questions in a neatly handwritten or typed report. Attach any graphs that are relevant. Be sure that your graphs are labeled either by hand or using the xlabel, ylabel, and title options. Please limit the length of the report to 2 pages.
sum(x)
ans =
17
Use sum to find the mean of the rainfall data set, climate.rain. If you name the mean mrain, find the standard deviation by first forming the sum of squared deviations from the mean:
z1 = (climate.rain-mrain)'*(climate.rain-mrain) .
Then divide by 50-1=49 and take the square root with sqrt. (You can always check your results using std(climate.rain).)
plot(golf.putts,golf.score,'*')
and golf.score versus golf.irons (If you previously used a subplot statement, then type clf to clear away the multiple plots.) The more narrow the "football" in the plot, the stronger the relationship between the variables. Which of the two variables golf.putts or golf.irons is more strongly related to golf.score? Remember that you can put both plots on the same page by using the subplot command before each plot (see problem 2 above, for example). The two plots look better if you also stretch the plot window to be taller.
William.McKinley 58 Theodore.Roosevelt 60 William.Taft 72 Woodrow.Wilson 67 Warren.Harding 57 Calvin.Coolidge 60 Herbert.Hoover 90 Franklin.Roosevelt 63 Harry.Truman 88 Dwight.Eisenhower 78 John.Kennedy 46 Lyndon.Johnson 64 Richard.Nixon 81(a) First get these numbers into a structure named pres by typing pres=read, then name age, and then cutting and pasting the above data into the command window. You need to hit return once or twice after pasting the data. Now type viewdata(pres) to see what you've created. Paste this last result into your report.
(b) Make a histogram (bar chart of frequencies) using the hist function. This is a small data set for histograms; so perhaps type hist(pres.age,6) to have fewer bins that the default of 10. Is the shape of the histogram approximately symmetric or skewed to the right (longer tail on right than on left)? (This is hard to see and not very trustworthy for such a small data set, but if you make a stem-and-leaf plot by hand it may be clearer.)
(c) Does this data accurately represent the life expectancies of presidents? (Hint: think about the cause of death and also about the ex-presidents not in this group: Carter, Reagan, Bush.)
(Recall the default for plot is to connect the points.)
Now do you see any patterns? One optional plot might be of interest:
plot(z.snow(1:29),z.snow(2:30),'*')
axis([-4 22 -4 22])
(The axis command just gives some space from the axes.)
This plots snowfall for one year versus that of the previous year. What
does an upward trend to the right mean here?