This first lab will give some practice using S to manipulate data
sets, calculate statistics, and make plots. As preparation for
this lab, it assumed that you are familiar with the short Introduction
to S and that you know how to log on and get set up:
at the UNIX prompt
(%)
type
add st370_info
followed by
S
Then from within Splus type
attach.slab()
and setup.slab().
Type
ls.class()
to see a listing of the class data sets.
One of them is
climate.
Type just its name
climate
to list it out.
This data set contains some climate and geographical information on
the 50 largest US cities and is in the form of a data frame. If you just
wanted to know the names of the different pieces of this list, you
could use
names(climate)
The component climate$rain
is rainfall in inches for these cities. Note that you
could print it on the screen by typing
climate$rain
If you don't like typing so many characters, it is simpler to put
climate$rain into
another data set with a simpler name:
climate$rain ->
z
Now type z to
confirm that z is
the same as climate$rain. For the rest of
this discussion you can use z instead of climate$rain.
One type of histogram (bar chart of frequencies) is a stem and leaf plot.
The command stem(climate$rain) prints out some summary
information about these data and produces a stem and leaf plot (just a
histogram on its side):
stem(climate$rain)
N = 50 Median = 32.555 Quartiles = 19.71, 40.1 Decimal point is 1 place to the right of the colon 0 : 7889 1 : 01245588 2 : 036999 3 : 0011112344567789999 4 : 013344589 5 : 138 6 : 0The values 7889 in the ``0'' row stand for the four data values 7.11, 7.82, 8.12, 9.32 in climate$rain. So the stem and leaf plot has first rounded the data values to the nearest integer. The first digit tells the row (7.11 is 07, so ``0'' row) and the second digit is listed on that row to the right of the ``:''.
To get some other statistics for this data set, try stats(climate$rain) (or stats(z)).
The statistics in this case have been written
to the screen. But if you wanted to save the results from this
command to a data set, say,
zork,
you would type:
stats(climate$rain)-> zork
To see zork just type it:
zork
[,1]
N 50.00000
mean 31.59700
Std.Dev. 13.65638
min 7.11000
Q1 18.03000
median 32.55500
Q3 39.12000
max 59.74000
missing values 0.00000
The data set zork has 9 values and you can refer to them
individually using subscripts. For example, zork[2] is the
second value and is the sample mean. Why would it make sense that
stats(climate) will return a 9x7 matrix? The missing
value code, NA, is returned for the component climate$city
because this is a list of city names and is not numerical.
Before going further let's take a very small data set and calculate by hand the mean, standard deviation (Std.Dev. from stats), and the median. First create the data set test as follows:
c(9,4,6,2,15) -> test
The formulas for the mean and standard deviation are as follows:
For the median you must order the data set first from lowest to highest: 2,4,6,9,15. The median is then the middle value 6. If there is an even number of values (say we add 25 to the data set to get n=6 values), then the median is the average of the middle two, (6+9)/2=7.5. The other statistics in stats are also simple: min is the smallest value and max is the largest value. Q1 and Q3 are the 25th and 75th percentiles, which don't mean much in a small data set. In a larger data set, Q1 is the value such that 25% of the data values fall below Q1, and Q3 is the value such that 75% of the data values lie below Q3.
The most important thing to know about computations in S is that they are applied to all the values in a data set at once. So for example to take the natural logarithm of the rainfall data, one would just type log(climate$rain). Or to convert from inches to centimeters: climate$rain*2.54. The following example creates a standardized data set by subtracting off the sample mean and dividing by the standard deviation.
stats(climate$rain) -> zork
(climate$rain - zork[2])/zork[3] -> rain.standard
If you have standardized the data correctly, stats(rain.standard) should have a mean of zero and a standard deviation of one.
All the plotting done in S will appear in the plot window on your screen. The S plot commands will either erase the window and draw a new plot or add to an existing plot. The S function hplot will plot a histogram of a data set. Recall that a histogram is a bar chart which gives a visual display of the frequency distribution of the data.
You might compare the distribution of the rainfall in the climate data set with the average minimum January temperature ( climate$jan) by making a histogram of each one.
hplot(climate$rain)
hplot(climate$jan)
As you type these commands, you will notice that the second plot erases the first! One solution is to divide the plot window into smaller pieces so that several plots can be drawn.
The S function set.panel(m,n) will lay out a matrix of plots
with m rows and n columns. For this example we just need 2 plots,
so use
set.panel(2,1)
and try plotting the histograms again.
The set.panel command is also a great way to consolidate
several plots to one page or to reduce their size.
What is the relationship between the minimum January temperature and latitude? To answer this question one might like to plot the January temperatures versus latitude for each of the 50 US cities. (To make typing simpler here, you might put climate into z with climate -> z, and use z in the following whereever you see climate.)
plot(climate$lat,climate$jan)
This gives a basic scatterplot of these data. Note that it is in the form plot(x,y) where x is plotted on the horizontal axis and y is on the vertical axis. Here is a more elaborate plot that adds labels (see if you can duplicate the figure pictured above):
plot(climate$lat,climate$jan,xlab="Latitude",ylab="Min.
Jan. Temperature")
title("50 Largest US Cities")
Note that the title command does not erase the plot but adds something to it. A more complicated plot is to add the city names at each of the plotted points. Recall that the city names are in the data set climate$city. So all we need to do is to put these strings of text in the right places on the previous plot.
text(climate$lat,climate$jan,climate$city)
Note that this is another plot function but it just adds something to an existing plot, putting labels at the x and y locations specified by climate$lat and climate$jan.
As a finale, to show you the power of the graphics in S, see if you can figure out what the following graph will look like:
usa()
points(climate$lon,climate$lat)
text(climate$lon,climate$lat,climate$city)
title("Locations of the 50 Largest US Cities")
Now try it out!
If you have a short data set, the easiest way to enter data is to
type it while in S. The command
read.data() -> look
will prompt you for data and put the results into a new data set
look.
If the data is in a UNIX file,
read.data("filename") -> look
will give the same result.
On Your Own
Present the answers to these questions in a neatly handwritten or typed report. Attach any graphs that are relevant. Be sure that your graphs are labeled either by hand or using the xlab, ylab, and title options. Please limit the length of the report to 2 pages.
sum(x)
[1] 17
Use this function to find the mean of the rainfall data set, climate$rain. If you name the mean mrain, next form the squared deviations from the mean by (climate$rain-mrain)**2->z1. Then using the functions sum and sqrt on z1, find the standard deviation of the data set. (You can always check your results using stats(climate$rain).)
hplot(climate$jan,limits=c(0,120),nclass=20)
followed by the same command with climate$jul replacing
climate$jan.
Also it helps to put the histograms on the same plot
by first using set.panel(2,1).
Finally, type stats(climate$jan) and
stats(climate$jul).
How do some of the statistics from using stats confirm your
visual impression?
| Name | Age |
|---|---|
| William McKinley | 58 |
| Theodore Roosevelt | 60 |
| William Taft | 72 |
| Woodrow Wilson | 67 |
| Warren Harding | 57 |
| Calvin Coolidge | 60 |
| Herbert Hoover | 90 |
| Franklin Roosevelt | 63 |
| Harry Truman | 88 |
| Dwight Eisenhower | 78 |
| John Kennedy | 46 |
| Lyndon Johnson | 64 |
read.data() -> presidents
and then entering the numbers. Hit return an extra time at the end to close the data entry process.
Alternatively, to learn how to use read.table,
If you type
z,
you will see that
z
is a data frame with the ages in
z$V2.
To be a little spiffier try typing
read.table("pres.data",row.names=NULL,col.names=c("name","age"))->z
The names of the presidents are in
z$name
and the ages are in
z$age.
Now do the following, being sure to use the correct name for the dataset you created, either presidents, z$V2, or z$age.
cavendish[7:29] -> cav.23
Do the mean and median change after deleting the first 6 points? Based on these data, what would you report as the best answer for the mean relative density of the earth? What is the answer based on current knowledge of the earth's interior? (look in ex(cavendish))
plot(z$snow[1:29],z$snow[2:30])
plots one year versus the previous year. What does an upward trend to the right mean here?