M-Lab 1 : Getting Used to MATLAB--Describing Data

This first lab will give some practice using MATLAB to manipulate data sets, calculate statistics, and make plots. To get set up, go to Getting Started. To exit MATLAB, just type quit.

The last setup command was to load the ST370 data sets:

load st370data

This command tells MATLAB to look for a file named st370data.mat in one of the directories in its path, and if found, bring its contents into memory. (By the way, if you type path, the last directory in the displayed list contains st370data.mat.) This particular .mat file just makes available the variables containing the data you are going to be using in M-Lab. (MATLAB uses the name "variables" for a variety of data structures. We prefer not to use that name in general because "variables" has a more specific meaning in statistics.)

To learn more about MATLAB, just type helpdesk. This will bring up a Netscape window with a variety of tutorials. At minimum you should click through "Getting Started."

To find out which data sets are defined at any given time in MATLAB (either before or after you add the ST370 data sets), you can use the who command, or for more detail, whos. So type who and whos to see what they produce. For example

>> whos
  Name            Size         Bytes  Class

  actuator        1x1           7512  struct array
  airplane        1x1            688  struct array
  ans             1x50           400  double array
  cancer          1x1            648  struct array
  capac           1x1           3070  struct array
  cavendish      29x1            232  double array
  climate         1x1           8700  struct array
  college         1x1           4354  struct array
     .             .              .      .     .
     .             .              .      .     .
     .             .              .      .     .
Notice that most of the objects listed have the name "struct array." These are actually vectors or matrices of numbers or characters called structure arrays, but they are different from regular MATLAB matrices. To look at a regular vector or matrix type data set, just type its name. For example, type cavendish because it is a regular vector of length 29 (a matrix of dimension 29x1). Next try typing the structure array climate :
>> climate

climate =

     lat: [50x1 double]
     jan: [50x1 double]
    rain: [50x1 double]
    city: {50x1 cell}
     jul: [50x1 double]
    elev: [50x1 double]
     lon: [50x1 double]
This data set contains some climate and geographical information on the 50 largest US cities. But since it is a "structure array" instead of a matrix, typing its names only lists what is in it. If you want to look at the whole structure array at once, then use viewdata(climate) or viewdata(climate,n) for the first n rows:
>> viewdata(climate,5)

Obs       lat    jan    rain          city    jul   elev         lon

  1   35.0833   22.3    8.12   albuquerque   92.8   5300   -106.6500
  2   33.7500   32.6   48.61       atlanta   87.9   1034    -84.3833
  3   30.2833   38.8   31.50        austin   95.4    570    -97.7500
  4   39.2833   24.3   43.39     baltimore   87.1    155    -76.6167
  5   42.1600   22.8   43.81        boston   81.8     10    -70.6002
Notice that the component climate.city is a vector of characters instead of numbers. The component climate.rain is rainfall in inches for these cities. You can print it on the screen by typing

climate.rain

If you do not like typing so many characters, it is easy to put climate.rain into another variable with a simpler name:

z = climate.rain

By now you may be getting tired of the verbose output that MATLAB gives you with every instruction you give it. What is actually going on is that the result of each operation is stored in a variable named ans, which is printed when the command finishes. To prevent MATLAB from printing ans, just add a semicolon after any command you give MATLAB:

y = climate.rain;

Now type y to confirm that y is the same as z, and that z is the same as climate.rain. For the rest of this discussion you can use z instead of climate.rain if you prefer.

One of our most useful functions is stats:

>> stats(climate.rain)

N          50.0000
Mean       31.5970
Std. Dev.  13.6564

Q1         19.2900
Median     32.5550
Q3         40.4300

Min         7.1100
Max        59.7400
Range      52.6300
We could have obtained those same values by typing individually length(z) (recall that z=climate.rain), mean(z), std(z), quantile(z,.25), quantile(z,.50), quantile(z,.75), min(z), max(z), and max(x)-min(x). It's usually simpler to just use stats(z) unless you need to save the output from one of these functions as another variable.

Before going further let us take a very small data set and calculate by hand the mean, standard deviation (Std. Dev. from stats), and the median. First create the data set test as follows:

test = [9 4 6 2 15]

Just for fun, use the single quote to look at the vector's transpose:

test'

Thus test is a row vector and test' is a column vector. The mean can be calculated by summing the terms in the vector and dividing the sum by the number of items in the vector:

(test(1) + test(2) + test(3) + test(4) + test(5)) / 5

To find the median, it helps to look at a sorted list of data points:

sort(test)

The median of course is the middle value 6 in the sorted list. We can find the standard deviation by taking the sums of the squares of each data point's difference from the mean, dividing by one less than the number of data points, and taking the square root:

sqrt(((test - mean(test)) * (test - mean(test))') / (5 - 1))

Here we took advantage of the way MATLAB handles subtraction when one operand is a vector and the other is a scalar (the scalar is subtracted from each member of the vector; this will come in handy later), and the fact that a vector multiplied by its transpose is equal to the sum of the squares of its elements. We could also have squared each element of the vector of means by using the .^ operator, which would have allowed us to calculate the standard deviation this way:

sqrt(sum((test - mean(test)) .^ 2) / (5 - 1))

Now, if we subtract out the means and divide by the standard deviation, we should get a normalized data set:

normalrain = (climate.rain - mean(climate.rain))/std(climate.rain)
std(normalrain)
mean(normalrain)




Plotting Data

A simple plot similar to the stem-and-leaf plot is a histogram, which we produce using the hist function:

hist(climate.rain)

The default is to divide the data into 10 groups. To change the default, just pass the desired number of groups to hist() as the second argument:

hist(climate.rain, 15)

Note that this plot erases the old one! If we just wanted multiple plots with each in its own window, we can use the figure command to create a new plot window and use it instead:

hist(climate.jan,6)
figure
hist(climate.jul,6)


Alternatively, you can put the plots on the same page with the subplot command. subplot(m,n,p) divides the current figure into m rows of n columns of plots, and then the next command will plot the pth one, counting from left to right. For example,

subplot(2,1,1)
hist(climate.jan,6)
subplot(2,1,2)
hist(climate.jul,6)


It looks better if you resize the figure with your mouse to be taller than it is wide. And if you print, click page setup and activate "Match Figure Screen Size" to make the figure fill the whole screen. Alternatively, you can type

set(gcf,'PaperPositionMode','auto');

to have it print the size that you see. (Also, when you print, you will not see the individual bars of the histogram because they will print as black--one solution is to type colormap cool before printing.)

Now, let us plot the latitude vs. the minimum January temperature variables:

figure
plot(climate.lat,climate.jan)


Having lines connecting the data points does not make sense here; so let us clear the window and try again, this time using only '*' to mark data points:

clf
plot(climate.lat,climate.jan,'*')


If you prefer some space around the edges, you can change the limits of the x and y axis with

axis([15 55 -3 75])

We can also add labels to the axes and a title:

xlabel('Latitude')
ylabel('Min. Jan. Temperature')
title('50 Largest US Cities')


What are the two cities that do not seem to follow the linear trend? Put the names of the cities on their plot points with

text(climate.lat,climate.jan,climate.city)

For further help in plot, type help plot .

Reading in Your Own Data

If you have a short data set, the easiest way to enter data is like we did with the test data set earlier

test = [9 4 6 2 15];

This produces a simple row vector. Typing in a short matrix is similar. Here is the example from the HelpDesk's Getting Started:

A = [16 3 2 13; 5 10 11 8; 9 6 7 12; 4 15 14 1]

A =

    16     3     2    13
     5    10    11     8
     9     6     7    12
     4    15    14     1
If you had A stored as a file in the directory from which you started MATLAB, you could also just type load A . This creates a matrix A in MATLAB just like the above.

Often, though, we will want to create a structure array that has names for the data columns and that also allows character values. For this we have created two functions read and readfile. Consider a data set

     185     72     a
     176     69     a
     192     74     b
     146     66     b
where the columns are weight, height, and a character variable group. Suppose that this data set is stored in the directory from which you started MATLAB under the name data1. Then, to get this into a structure array named B, just type

B=readfile

You will them be prompted to type in the name of the file; in this case we typed

data1

You will them be prompted to type in the names of the variables; in this case we typed

weight height group

Sometimes we don't have the data stored in a file. The function read allows one to just paste data directly into the MATLAB window. This is especially useful when doing homework assignments from WebAssign.
(Important note: if you are using MATLAB version 6.5, then read does not work and you will always have to use readfile after creating a text file of the data.)

To create a structure array named B, just type

B=read

You will them be prompted to type in the names of the variables; as above we typed

weight height group

Finally you will be prompted to
cut and paste the data directly into the MATLAB command window. To make sure you understand, do this from the above listing and hit return. Type viewdata(A) to see the result.

Several other useful functions are struct2file (creates a text file from a struct), save (saves data files created for future use to be recalled using load),and diary (saves all the screen text from a MATLAB session). Just use the help function to learn about these when needed. help mlab will display the most important functions used in M-lab. Or you can always have a netscape window open to the appendix listing of the functions.

Don't forget, to exit MATLAB, type quit.


On Your Own

Present the answers to these questions in a neatly handwritten or typed report. Attach any graphs that are relevant. Be sure that your graphs are labeled either by hand or using the xlabel, ylabel, and title options. Please limit the length of the report to 2 pages.

  1. The MATLAB function sum takes the values in a data set and returns the sum. For example, if x has values 3,6,8, then we get:

    sum(x)

    ans =
    
        17
    

    Use sum to find the mean of the rainfall data set, climate.rain. If you name the mean mrain, find the standard deviation by first forming the sum of squared deviations from the mean:

    z1 = (climate.rain-mrain)'*(climate.rain-mrain) .

    Then divide by 50-1=49 and take the square root with sqrt. (You can always check your results using std(climate.rain).)

  2. Using climate.rain, create a new data set that has the rainfall measurements converted to millimeters (one inch is 25.4 millimeters). Then use stats on climate.rain and on the transformed data and compare the mean, median, and standard deviation of the two data sets. Is there a pattern? Describe the relationship between the mean, median, and standard deviation of the two data sets.

  3. Using histograms, compare the distribution of the minimum January temperatures (climate.jan) with the maximum July temperatures (climate.jul). In particular, which set of temperatures has more variability? To facilitate the comparison, force hist to use the same x axis for each histogram:

    subplot(2,1,1)
    hist(climate.jan)
    axis([0 120 -inf inf])
    subplot(2,1,2)
    hist(climate.jul)
    axis([0 120 -inf inf])


    (For better printing don't forget to type colormap cool, lengthen the window with your mouse, click page setup under print and activate "Match Figure Screen Size.") Finally, compute some basic statistics using stats. How do some of these statistics confirm your visual impression?

  4. Make a graph that plots the January temperatures versus July temperatures. Comment on any relationships that may appear between these two variables. Identify any cities that seem to have ``unusual'' climates relative to the others.

  5. The data set golf has the scores of 195 rounds of golf made by professional golfers and some additional variables that try to explain how their score depends on certain golf skills such as putting and chipping and iron play. What is more important for determining a good (= low) golf score, putting or iron play? Plot golf.score versus golf.putts, i.e.,

    plot(golf.putts,golf.score,'*')

    and golf.score versus golf.irons (If you previously used a subplot statement, then type clf to clear away the multiple plots.) The more narrow the "football" in the plot, the stronger the relationship between the variables. Which of the two variables golf.putts or golf.irons is more strongly related to golf.score? Remember that you can put both plots on the same page by using the subplot command before each plot (see problem 2 above, for example). The two plots look better if you also stretch the plot window to be taller.

  6. The ages of US presidents who have died since 1900 are as follows:
    William.McKinley   58
    Theodore.Roosevelt 60
    William.Taft       72
    Woodrow.Wilson     67
    Warren.Harding     57
    Calvin.Coolidge    60
    Herbert.Hoover     90
    Franklin.Roosevelt 63
    Harry.Truman       88
    Dwight.Eisenhower  78
    John.Kennedy       46
    Lyndon.Johnson     64
    Richard.Nixon      81
    
    (a) First get these numbers into a structure named pres by typing pres=read, then name age, and then cutting and pasting the above data into the command window. You need to hit return once or twice after pasting the data. Now type viewdata(pres) to see what you've created. Paste this last result into your report.
    (Important: If you are using version 6.5, then you cannot use pres=read. Instead, you first need to paste the data into notepad, save it as a file, say pres.txt, and then use pres=readfile. You will then be prompted for the file location of pres.txt, and then for the names.)

    (b) Make a histogram (bar chart of frequencies) using the hist function. This is a small data set for histograms; so perhaps type hist(pres.age,6) to have fewer bins that the default of 10. Is the shape of the histogram approximately symmetric or skewed to the right (longer tail on right than on left)? (This is hard to see and not very trustworthy for such a small data set, but if you make a stem-and-leaf plot by hand it may be clearer.)

    (c) Does this data accurately represent the life expectancies of presidents? (Hint: think about the cause of death and also about the ex-presidents not in this group: Carter, Reagan, Bush.)

  7. The data set cavendish contains measurements on the mean density of the earth relative to that of water. First get some descriptive statistics using the stats function. . The command plot(cavendish,'*') will give a plot of points in the order in which they were produced. From the description of the data set you will find that the experimental apparatus was changed after the 6th point was taken. Create a new data set cav23 with only points 7 through 29 as follows:

    cav23 = cavendish(7:29);

    Do the mean and median change after deleting the first 6 points? Based on these data, what would you report as the best answer for the mean relative density of the earth? What is the answer based on current knowledge of the earth's interior? (Look in a description of the data set.)

  8. The data set raleigh.snow contains the annual snowfall totals (in inches) for Raleigh from the 1962-63 season through the 1992-92 season. First type z = raleigh.snow and viewdata(z) to view the data. Then use the stats function to summarize the data. Next, plot the snowfall amounts against the years the measurements were taken:

    plot(z.year,z.snow,'*')

    Do you see any patterns over time? Re-plot the data using lines to connect the points:

    plot(z.year,z.snow)

    (Recall the default for plot is to connect the points.) Now do you see any patterns? One optional plot might be of interest:

    plot(z.snow(1:29),z.snow(2:30),'*')
    axis([-4 22 -4 22])


    (The axis command just gives some space from the axes.) This plots snowfall for one year versus that of the previous year. What does an upward trend to the right mean here?