Lab 1: Getting Used to S--Describing Data

This first lab will give some practice using S to manipulate data sets, calculate statistics, and make plots. As preparation for this lab, it assumed that you are familiar with the short Introduction to S and that you know how to log on and get set up: at the UNIX prompt (%) type

add st370_info

followed by

S

Then from within Splus type attach.slab() and setup.slab().

Viewing Data

Type

ls.class()

to see a listing of the class data sets. One of them is climate. Type just its name

climate

to list it out. This data set contains some climate and geographical information on the 50 largest US cities and is in the form of a data frame. If you just wanted to know the names of the different pieces of this list, you could use

names(climate)

The component climate$rain is rainfall in inches for these cities. Note that you could print it on the screen by typing

climate$rain

If you don't like typing so many characters, it is simpler to put climate$rain into another data set with a simpler name:

climate$rain -> z

Now type z to confirm that z is the same as climate$rain. For the rest of this discussion you can use z instead of climate$rain.

One type of histogram (bar chart of frequencies) is a stem and leaf plot. The command stem(climate$rain) prints out some summary information about these data and produces a stem and leaf plot (just a histogram on its side):

stem(climate$rain)


N = 50   Median = 32.555
Quartiles = 19.71, 40.1

Decimal point is 1 place to the right of the colon

   0 : 7889
   1 : 01245588
   2 : 036999
   3 : 0011112344567789999
   4 : 013344589
   5 : 138
   6 : 0

The values 7889 in the ``0'' row stand for the four data values 7.11, 7.82, 8.12, 9.32 in climate$rain. So the stem and leaf plot has first rounded the data values to the nearest integer. The first digit tells the row (7.11 is 07, so ``0'' row) and the second digit is listed on that row to the right of the ``:''.

To get some other statistics for this data set, try stats(climate$rain) (or stats(z)). The statistics in this case have been written to the screen. But if you wanted to save the results from this command to a data set, say, zork, you would type:

stats(climate$rain)-> zork To see zork just type it:

zork


                     [,1]
               N 50.00000
            mean 31.59700
        Std.Dev. 13.65638
             min  7.11000
              Q1 18.03000
          median 32.55500
              Q3 39.12000
             max 59.74000
  missing values  0.00000

The data set zork has 9 values and you can refer to them individually using subscripts. For example, zork[2] is the second value and is the sample mean. Why would it make sense that stats(climate) will return a 9x7 matrix? The missing value code, NA, is returned for the component climate$city because this is a list of city names and is not numerical.

Before going further let's take a very small data set and calculate by hand the mean, standard deviation (Std.Dev. from stats), and the median. First create the data set test as follows:

c(9,4,6,2,15) -> test

The formulas for the mean and standard deviation are as follows:

displaymath149

displaymath150

For the median you must order the data set first from lowest to highest: 2,4,6,9,15. The median is then the middle value 6. If there is an even number of values (say we add 25 to the data set to get n=6 values), then the median is the average of the middle two, (6+9)/2=7.5. The other statistics in stats are also simple: min is the smallest value and max is the largest value. Q1 and Q3 are the 25th and 75th percentiles, which don't mean much in a small data set. In a larger data set, Q1 is the value such that 25% of the data values fall below Q1, and Q3 is the value such that 75% of the data values lie below Q3.

Manipulating Data

The most important thing to know about computations in S is that they are applied to all the values in a data set at once. So for example to take the natural logarithm of the rainfall data, one would just type log(climate$rain). Or to convert from inches to centimeters: climate$rain*2.54. The following example creates a standardized data set by subtracting off the sample mean and dividing by the standard deviation.

stats(climate$rain) -> zork
(climate$rain - zork[2])/zork[3] -> rain.standard

If you have standardized the data correctly, stats(rain.standard) should have a mean of zero and a standard deviation of one.



Plotting Data

All the plotting done in S will appear in the plot window on your screen. The S plot commands will either erase the window and draw a new plot or add to an existing plot. The S function hplot will plot a histogram of a data set. Recall that a histogram is a bar chart which gives a visual display of the frequency distribution of the data.

You might compare the distribution of the rainfall in the climate data set with the average minimum January temperature ( climate$jan) by making a histogram of each one.

hplot(climate$rain)
hplot(climate$jan)

As you type these commands, you will notice that the second plot erases the first! One solution is to divide the plot window into smaller pieces so that several plots can be drawn.

The S function set.panel(m,n) will lay out a matrix of plots with m rows and n columns. For this example we just need 2 plots, so use

set.panel(2,1)

and try plotting the histograms again. The set.panel command is also a great way to consolidate several plots to one page or to reduce their size.

What is the relationship between the minimum January temperature and latitude? To answer this question one might like to plot the January temperatures versus latitude for each of the 50 US cities. (To make typing simpler here, you might put climate into z with climate -> z, and use z in the following whereever you see climate.)

plot(climate$lat,climate$jan)

figure76

This gives a basic scatterplot of these data. Note that it is in the form plot(x,y) where x is plotted on the horizontal axis and y is on the vertical axis. Here is a more elaborate plot that adds labels (see if you can duplicate the figure pictured above):

plot(climate$lat,climate$jan,xlab="Latitude",ylab="Min. Jan. Temperature")
title("50 Largest US Cities")

Note that the title command does not erase the plot but adds something to it. A more complicated plot is to add the city names at each of the plotted points. Recall that the city names are in the data set climate$city. So all we need to do is to put these strings of text in the right places on the previous plot.

text(climate$lat,climate$jan,climate$city)

Note that this is another plot function but it just adds something to an existing plot, putting labels at the x and y locations specified by climate$lat and climate$jan.

As a finale, to show you the power of the graphics in S, see if you can figure out what the following graph will look like:

usa()
points(climate$lon,climate$lat)
text(climate$lon,climate$lat,climate$city)
title("Locations of the 50 Largest US Cities")

Now try it out!

Reading in Your Own Data

If you have a short data set, the easiest way to enter data is to type it while in S. The command read.data() -> look will prompt you for data and put the results into a new data set look. If the data is in a UNIX file, read.data("filename") -> look will give the same result.

On Your Own

Present the answers to these questions in a neatly handwritten or typed report. Attach any graphs that are relevant. Be sure that your graphs are labeled either by hand or using the xlab, ylab, and title options. Please limit the length of the report to 2 pages.

  1. The S function sum takes the values in a data set and returns the sum. For example, if x has values 3,6,8, then we get:

    sum(x)

    
    [1] 17
    

    Use this function to find the mean of the rainfall data set, climate$rain. If you name the mean mrain, next form the squared deviations from the mean by (climate$rain-mrain)**2->z1. Then using the functions sum and sqrt on z1, find the standard deviation of the data set. (You can always check your results using stats(climate$rain).)

  2. Using climate$rain, create a new data set that has the rainfall measurements converted to millimeters. Then type stats(climate$rain) and stats of the transformed data and compare the mean, median, and standard deviation of the two data sets. Is there a pattern?

  3. Using histograms, compare the distribution of the minimum January temperatures (climate$jan) with the maximum July temperatures (climate$jul). In particular, which set of temperatures has more variability? To facilitate the comparison, force hplot to use the same X axis and the same number of bars for each histogram:

    hplot(climate$jan,limits=c(0,120),nclass=20)

    followed by the same command with climate$jul replacing climate$jan. Also it helps to put the histograms on the same plot by first using set.panel(2,1). Finally, type stats(climate$jan) and stats(climate$jul).
    How do some of the statistics from using stats confirm your visual impression?

  4. Make a graph that plots the January temperatures versus July temperatures. Comment on any relationships that may appear between these two variables. Identify any cities that seem to have ``unusual'' climates relative to the others.

  5. The data set college$inc contains the per capita income of 15 US states. Briefly analyze this data set and describe several key features.

  6. The ages of US presidents who have died since 1900 are as follows:

    NameAge
    William McKinley58
    Theodore Roosevelt60
    William Taft72
    Woodrow Wilson67
    Warren Harding57
    Calvin Coolidge60
    Herbert Hoover90
    Franklin Roosevelt63
    Harry Truman88
    Dwight Eisenhower78
    John Kennedy46
    Lyndon Johnson64


    First get these numbers into a data set by typing

    read.data() -> presidents

    and then entering the numbers. Hit return an extra time at the end to close the data entry process.

    Alternatively, to learn how to use read.table,

    If you type z, you will see that z is a data frame with the ages in z$V2. To be a little spiffier try typing
    read.table("pres.data",row.names=NULL,col.names=c("name","age"))->z
    The names of the presidents are in z$name and the ages are in z$age.

    Now do the following, being sure to use the correct name for the dataset you created, either presidents, z$V2, or z$age.

    1. Make a stem and leaf plot using the stem function. Order the data from smallest to largest (by hand or use sort(presidents)) and explain how the median and the quartiles were actually calculated for this data set. (These are printed output of stem. The median is explained in the lab, but you are to figure out how the quartiles were calculated.)
    2. Type bplot(presidents) and try to figure out what is plotted by looking at the stem output from a). A bplot (usually called a boxplot) is a useful way to represent data, and we will explore it more fully in Lab 2.
    3. Make a histogram (bar chart of frequencies) using the hplot function. Is the shape of the histogram approximately symmetric or skewed to the right (longer tail on right than on left)?
    4. Does this data accurately represent the life expectancies of presidents? (Hint: think about the cause of death and also about the ex-presidents not in this group: Nixon, Carter, Reagan, Bush.)

  7. The data set cavendish contains measurements on the mean density of the earth relative to that of water. First type ex(cavendish) to read about the experiment which produced the data. Then get some descriptive statistics using the stats or stem function. plot(cavendish) will give a plot of points in the order in which they were produced. From the description of the data set you will find that the experimental apparatus was changed after the 6th point was taken. Create a new data set cav.23 with only points 7 through 29 as follows

    cavendish[7:29] -> cav.23

    Do the mean and median change after deleting the first 6 points? Based on these data, what would you report as the best answer for the mean relative density of the earth? What is the answer based on current knowledge of the earth's interior? (look in ex(cavendish))

  8. The data set raleigh.snow contains the annual snowfall totals (in inches) for Raleigh from the 1962-63 season through the 1992-92 season. First type raleigh.snow -> z and z to view the data. Then type stats(z$snow) to summarize the data. Next type set.panel(2,1) and plot(z$year,z$snow). Do you see any patterns over time? Type lines(z$year,z$snow) to connect the points. Now do you see any patterns? (One final optional plot might be of interest:

    plot(z$snow[1:29],z$snow[2:30])

    plots one year versus the previous year. What does an upward trend to the right mean here?