What is S?

S (or the commercial name Splus) is a statistical program for graphics and data analysis that runs under the UNIX operating system. This Introduction will give you some basic information about how it works. We suggest that you skim through it quickly and refer back to it as needed. The labs which follow are also tutorials in the use of S as well as learning tools for statistical analysis of data.

Two important hints as you get started:

  1. Whenever you enter a command you must hit the return key in order to get the attention of the system.
  2. UNIX and S are case sensitive, so capital letters are distinguished from lower case letters. For example, if the correct name of a data set is zork, typing in ZORK or Zork will not work.

Running S

At the UNIX prompt (we will assume that it is just a %) type in

% S

It takes a few seconds for the S program to be loaded. Once it is set up, S will print something like :


S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 5.1 Release 1 for Sun SPARC, SunOS 5.5 : 1999
Working data will be in .Data
>

To quit S type

q()

Setting up S for lab work

Once S is started, it will give you a prompt that is a greater than sign (). You will need to set up the S session so that the special class data sets and functions are available. This involves typing two commands. (At this point, don't worry about what these mean.)

attach.slab()
setup.slab()

After the second command, S will try to open up a plot window, give some tips about using S, and check how much file space you have used up. Once everything is done, you will get the class prompt Slab > to remind you that you have access to the S-Lab functions and data sets. However, we will only use as the prompt indicator in these notes (this is standard in most S documentation).



Waiting for Godot? Where is the off switch?

Throughout this guide it will be assumed that you know to press the return (or enter) key after you have typed in the full command. If you are waiting for S to calculate or to graph something, it is often a good idea to hit the return key just to make sure it is not waiting for this return keystroke. Usually typing an extra return will not cause any problems, and it is a good way to be sure S is still listening to you. If you hit a return by mistake or try to type beyond the end of a line, S usually knows that you are not finished. It will ask for more by giving you a plus sign (+) as the prompt.

If at any stage you want S to ignore what you have typed, hold down the control key and press the C key ( i.e. type ``control C''). This will halt anything unpleasant that might be happening and bring you back to the normal Slab prompt.



S commands (general)

S commands are always followed by open and closed parentheses:

command(thing1, thing2, ...)

Sometimes the command needs some numbers or the name of a data set from you, and that information is given inside the parentheses. For example,


 sqrt(2)                  We typed this and hit return.

[1] 1.414214 The computer returns the result.

Note that in the above example, we did not save the result of taking the square root, and so it was just printed to the screen. To save the result of a command into a data set, use the operator -> ( a minus sign followed by a greater than sign with no spaces in between). So if you are issuing commands and saving the results to data sets, the general format looks like:

command(thing1, thing2, ...) -> dataset

If you don't mind thinking backwards, you can type the output data set first, the operator <- second, and then the S command. S aficionados tend to do it this way. For example:


sqrt(x) -> data4       Puts the square root of x in data4.

data4 <- sqrt(x) Exactly the same as the previous command.

The S language is object-oriented. This means that the results of applying a command to a data set may depend on the type of data. You should keep this in mind if you ever get strange results from using a familiar command. This is especially true of the plot command as it will make different types of graphs depending on what you give it.


S data sets (general)

Everything that is not a command is a data set. A data set can have any type of name, but be sure that it does not start with a numeral or contain an underscore ( _ ). For example, 3dat and dat_3 are not allowed, but dat.3, dat.four.3, and dat.3.four are fine.

If you ever want to list a data set on the screen just type its name. For example, suppose that the numbers 10,2,3,5,8 have been put in a data set called test1. To list test1 just type it:

test1


[1] 10  2  3  5  8
It is helpful to think of test1 as a column vector with five rows (but to save space S prints it in a row). To refer to the fourth member of this vector we just use subscripts:

test1[4]

[1] 5
Suppose that we also have a second data set test2

test2

[1]  9  4  6  7 12
Here is a way to combine test1 and test2 into a matrix:

cbind(test1,test2)->test3
test3

     test1 test2
[1,]    10     9
[2,]     2     4
[3,]     3     6
[4,]     5     7
[5,]     8    12
cbind stands for ``column bind'' and creates a matrix data set. You can refer to elements in test3 using two subscripts. For example, the number in the fourth row and second column (7) is test3[4,2]. You can refer to the fourth row by typing test3[4,].

Data sets that have several pieces are called lists, and many of the data sets that are used in the lab will be in this form. For example, climate is a class data set that has various climate information for the 50 largest US cities. The individual components in the list are specified by a dollar sign ($) followed by the name of the component. For example climate$rain refers to the 50 precipitation values for the cities. climate$elev refers to the 50 elevations for the cities. Note that the component climate$city contains the cities' names and thus is not a numerical data set. You can refer to the third city by typing

climate$city[3]

[1] "austin"
The climate data set is actually a special type of list called a data frame which has characteristics of a list but is like a matrix as well. More will be said about data.frames in a later section.

Reading in data

Suppose that for some sports team you have the heights and weights of the players in two data sets: height and weight. These two data sets may be just sitting in the class directory or you might have actually typed them in using the read.data() function. Let's actually do this.


read.data() -> height We typed this and hit return.


1: 72   The computer returned 1: and we typed 72 and return.


2: 69   The computer returned 2: and we typed 69 and return.


3: 74   The computer returned 3: and we typed 74 and return.


4: 66   The computer returned 4: and we typed 66 and return.


5:      The computer returned 5: and we hit return.
Similarly we can create weight:

read.data() -> weight

1: 185
2: 176
3: 192
4: 146
5:
Another way to create the same file is with the combine function c():

c(185,176,192,146) -> weight

Data can also be read in from a file. The only difference is that the UNIX file name needs to specified. For example if the weights were in a file called w.data, ten use read.data('w.data')-> weight. Note the use of the single left quote marks around the file name. If you are familiar with a text editor, reading from a file is often an easier way to create larger data sets.



Creating a data frame

In the example given above the body measurements have been made on the same players. Because of this natural pairing it is useful to bundle the two data sets together in a table or matrix format. The columns should be heights and weights, and the rows should represent the values for each individual. In S this sort of format is called a data frame and is very useful for analyzing several variables at once. Here is how to make a data frame called team that combines the previously entered height and weight data sets.

data.frame(weight,height) -> team

To view this new data set just type its name:

team

   weight height
1    185     72
2    176     69
3    192     74
4    146     66
Many of the data sets used in these labs will be data frames. and the first and second labs will give you some practice in using this format. Most of the time a data frame in S acts like a matrix. The main difference is that a data frame can have columns of character (text) information. For example, list out the data set climate$city, and you will see that city is a text variable.

One useful function is names. It will just tell you the names of the columns without listing out the whole data set. For example,

names(climate)

[1] "lat"  "jan"  "rain" "city" "jul"  "elev" "lon"
You may also read in data from a file and create a data frame all in the same step. Suppose that in the directory from which you started S you have a file called p.data
    185     72     a
    176     69     a
    192     74     b
    146     66     b
where ``a'' and ``b'' might refer to two different groups of people. Then

read.table("p.data",col.names=c("weight","height","group")) -> team2

will create a data frame with two columns of numbers and one column of characters:

team2

  weight height group
1    185     72     a
2    176     69     a
3    192     74     b
4    146     66     b



Creating a list

For the team data set a table format was appropriate because the rows of the table would also make sense. There are many examples of data sets where the information does not fit together nicely as a table. A list is a data set type that can handle arbitrary collections of data. The data set drill.bit.list is a simple example of the results for two independent experiments. Five drill bits of one brand were tested and seven of another.

drill.bit.list

$besly:
[1] 346 375 442 249 280 428

$cleveland:
[1]  63 124 262  92 192 122 134 128
Assume that the data values for the Besley and Cleveland bits were read in separately and stored in the data sets, say b.dat and c.dat. Here is how to create the drill.bit list:

list() -> drill.bit.list
b.dat -> drill.bit.list$besley
c.dat -> drill.bit.list$cleveland

You are not limited to a certain number of components for a list and can add more components as you work. Although lists are useful for holding data sets, they are also important for organizing the results of a statistical analysis or a complicated plot, and with a little practice they are very easy to work with. Lists are an excellent way to organize your work on a specific homework problem under a single name.



What is needed for a command

Many commands in S do not need any information or data sets to work. For example attach.slab or q (quit) do not require an argument. Other commands such as sqrt would not make sense without specifying a data set or a number. Finally there are other commands that will take different amounts of information depending on what is needed. For example the plot function can take two data sets say x and y and produce a scatterplot of these values (plot(x,y)). Specifying only one dataset, plot(zork) will result in the data values being plotted against equally spaced x values. Clearly the plot function has been designed to make some default choices based on what it is given. One merit of S is that the default choices are usually ones a human might want. Also, it is easy to override the default choices when they are inappropriate. For example the plot function defaults to using the data set names for the X and Y axes labels. To change these one just needs to know the names of these two parameters (use the help or args commands) and then supply your choices. The name of the label used for the Y axis in the plot function is ylab so

plot(year,zork,ylab='Number of bats found in Cox hall')

will indicate a different label for the X axis.

Another way of specifying additional arguments to a command is just by the order that S is expecting them. In this style, omitted arguments are just skipped over using comma's. See the remarks about the seq command in the following section on generating data for an example of this syntax. This second method based on order can save typing but is harder for beginners. However, there are some exceptions for common arguments. It is easy to remember that the plot functions first two arguments need to be the x and y data sets. They could be given out of order by referring to their names: plot(y=zork, x=time) although this seems silly compared to just plot(time, zork)!

Help on help

In the current version 5.1 of S the help command brings up a Netscape window that is very slow. At the moment we don't recommend its use. The best way to access it is to type help.start().

An alternative to help is the function ex (ex for example) which gives examples associated with a command rather than the full help output. Also ex(plot) will print directly to the screen rather than on a help window.



Some basic S commands

Given below are some basic S commands, a brief explanation, and some examples of their use. As part of the labs, you will learn more about S, but these commands represent a core of what you will need to know. With these basic tools one can do an amazing amount of graphics and data analysis.



Housekeeping and IO

help.start()
brings up Netscape help window

ls()
lists the data sets in your own directory

ls.class()
lists the data sets in the common class directory

q()
quits S and returns to UNIX

c(thing1, thing2, ...)
combines several values or data sets into one (separate these by commas when you type the command)
For example, create a data set of 3 numbers called test:
c(2,3,5) -> test

read.data()
interactively reads in data that you type in. To stop the input, just hit return without typing a number. This is the same function as scan; see Section 0.7 for an example.
For example, read a data set by typing it in from the keyboard and call it garbage:
read.data() -> garbage

read.data("filename")
reads data from filename
For example, if an ascii file data1 is in the UNIX subdirectory where Splus was started, then read.data("data1") -> zork puts the data in zork. Note that it reads each row and puts the data into one row vector. Thus if data1 is

72  185
69  176
74  192
66  146
then zork is the vector

72,185,69,176,74,192,66,146
To read in data that are characters set the default option text to ``true.''

read.data(text=T) -> players.names

Here the command will prompt you the same way but what is typed in will be interpreted as text instead of numbers ( even if you type in numeric characters). Of course you can also read character data from a file

read.data("roster",text=T) -> players.names



write.data(data4,"filename")
writes the S dataset data4 to a data set filename in the UNIX directory from which S was started.



rm(dataset)
removes a dataset from your personal directory For example, remove the data set garbage from your personal directory rm(garbage)

Arithmetic

x*y
Multiplies x times y

x/y
divides x by y

x**y
raises x to the y power

x+y, x-y
adds and subtracts x and y.

cos(x), sin(x), sqrt(x),
applies the function to all the values in x

log(x), exp(x), x**2 etc.
Examples of related operations:

Subtract 4.2 from the numbers in test and over-write this data set with the new results:

test-4.2 -> test

Square the numbers in the data set test and save the results to the new data set test.squared (note the use of a period to make the name readable):

test**2 -> test.squared



Statistics and Graphics

stats(dataset)
calculates several useful statistics for a dataset

For example, get the mean and standard deviation of the dataset test:

stats(test)

stem(dataset)
prints out on the screen a histogram-like summary of the dataset (stem and leaf plot)

For example, make a stem and leaf plot of the simulated data in random.numbers.

stem(random.numbers)

plot(dataset1,dataset2)
plots dataset1 versus dataset2 in the plot window

lplot(dataset1,dataset2)
plots dataset1 versus dataset2 in the plot window when either dataset1 or dataset2 is a character variable

lplot(d1,d2,d3) plots d1 versus d2 using the values in d3 as labels

lines(dataset1,dataset2)
like plot but adds lines to the current plot

For example, plot the sine function at several points:

c(0,.1,.2,.3,.4,.5,.6) -> x

sin(x) -> y

plot(x,y)

Then add a cosine curve to the plot

cos(x) -> y2

lines(x,y2)

Another way to add the cosine is to combine the two steps

lines(x,cos(x))

points(dataset1,dataset2)

like plot but adds points to the current plot

Generating Data Sets

seq(a,b,delta)
generates a sequence of equally spaced points from a to b that are delta apart. (There will be roughly (b-a)/delta of these.)

For example, generate a grid of equally spaced points in the range -1 to 2 with a spacing of .01:

seq(-1,2,.01) -> x.grid

seq(a,b,,n)
generates a sequence of n equally spaced points from a to b.
For example, generate 100 equally spaced points in the range -1 to 2:

seq(-1,2,,100) -> x.grid

(note the double commas)

1:12 -> x
creates a vector x with the first 12 integers

runif(n)
generates n random numbers between 0 and 1

For example, generate 100 random numbers between 0 and 1 and put them in random.numbers
runif(100) -> random.numbers