R is an open source
programming language and software environment for statistical computing
and graphics. The R language is widely used among statisticians for
developing statistical software and data analysis. R was created by
Ross
Ihaka and Robert
Gentleman at the University of
Auckland, New Zealand, and now, R is developed by
the R Development Core Team.
To download R and install it on your
computer, you can get it at the Comprehensive R Archive Network
(http://cran.r-project.org). One option that you may want to explore
is RStudio (http://rstudio.org) which is a very nice front-end to R and
works on all platforms.
The R System is divided into 2 conceptual parts:
R objects can have attributes: names or dimnames, dimensions (e.g. matrices, arrays), class, length, or any other user-defined attributes/metadata. attributes of an object can be access using the attributes() function.
Go to Start > Programs > R
> R . This will open the R prompt which we will use to test basic
statements. When you enter an expression into the R prompt and press
Enter, R will evaluate that expression and display the results (if there
are any).
as you can see the rules of
precedence are applied here. Notice the weird “[1]” that accompanies
each returned value. In R, any number that you enter in the console is
interpreted as a vector. A vector is an ordered collection of numbers.
The “[1]” means that the index of the first item displayed in the row
is 1. In each of these cases, there is also only one element in the
vector.
The <- symbol is the assignment
operator. When a complete expression is entered at the prompt, it is
evaluated and the result of the evaluated expression is returned. The
result may/not be auto-printed. The [1] indicates the index of the
element in the vector. The # character indicates a comment and anything
right to it is ignored.
you can also assign an object on the
left to a variable on the right
= means assign the value of the
right hand side to the variable on the left hand side. == tests
variables for equality
The : operator can be
used to create integer sequence vector.
The numbers in the brackets on the
left-hand side of the results indicate the index of the first element
shown in each row.
THe c() function can be used to create
vectors of objects.
You could use the vector() function to
specify the vector type and length and it will create an empty vector
for you.
Variables can be used in creating
vectors, their values will replace their names
What about mixing different objects in a vector, implicit casting occurs
so that every element in the vector is of the same class (least common
dominator).
You can refer to specific member
using its location,
For you reference: [ ] always returns an object of the same class as the original; can be used to select more than one element (retrieving matrix elements is exception to this rule, it returns a vector)
or members using location or
expression.
or specific members using their
indices as integer vector (they will be retrieved in the order of
reference, not by their order in the original vector)
When you perform an operation on two
vectors, R will match the elements of the two vectors pairwise and
return a vector. This is called vectorized operation.
If the two vectors aren’t the same
size, R will repeat the smaller sequence multiple times:
Note the warning if the second sequence isn’t a
multiple of the first.
Objects can be explicitly casted
from one class to another using the as.* functions, if available. like
:
Nonsensical casting results in NAs
An array is a multidimensional
vector. Vectors and arrays are stored the same way internally, but an
array may be displayed differently and accessed differently. An array
object is just a vector that’s associated with a dimension attribute.
The dimension attribute itself is an integer vector of length 2 (nrow,
ncol). Items can be referenced by its indices.
you can refer to part of the array
by specifying separate indices for each dimension, separated by
commas:
to get all values in one dimension,
simply omit the indices for that dimension:
three dimensional arrays
A Matrix is just a two-dimensional
array.
Matrices are constructed column-wise, so entries can be thought of
starting in the upper left corner and running down the
columns.
based on the above fact, matrices can be created directly from vectors
by adding a dimension attribute to a vector
Matrices can be created by column-binding cbind() or row-binding rbind(). the example explains it all
Lists are a special type of vector that can contain elements of different data types (notice that its printing is different)
For your reference: [[ ]] is used to extract elements of a list or a data frame; it can only be used to extract a single element and the type of the returned object doesn’t have to be a list or data frame. Doesn’t support partial name matching, passed name have to be exact.
You can name each element in a list.
Items in a list may be referred by either location or name. $ is used to
retrieve elements by name. It also supports partial name matching
(passing part of the name, not all of it)
A list can even contain other lists
(we will refer to previous list e):
Factors are used to represent
categorical data. Factors can be unordered or ordered. Its like an
integer vector where each integer has a label, so you create a vector of
any type that is treated internally by integers. The following example
creates a factor from a vector of strings. When it prints, it prints the
values and it has an attribute Levels
that represents data categories of the factor elements.
Factors are treated specially by
modeling functions like lm() and glm() which we will discuss later.
Using factors with labels is better than using integers because factors
are self-describing; having a variable that has values “Male” and
“Female” is better than a variable that has values 1 and 2.
We can call table() function on
factor c and it will give us the frequency table of each level
(category)
We can also call unclass() function
on the factor to strip out the factor categories and show us how it is
stored underneath.
The order of the level can be set
using the levels argument to factor(). This can be important in linear
modeling because the first level is used as the baseline level (the
first level in the factor). If you didn’t assign levels explicitly, it
will be assigned alphabetically (that’s why “no” came before “yes” in
the previous example).
Missing values are denoted by NA or NaN for undefined mathematical
operations. is.na() and is.nan() are used to test objects if they are
NA or NaN, respectively. NA values have a class also, so there are
integer NA, character NA, etc. A NaN value is also NA but the converse
is not true.
A data frame is a list that contains
multiple named vectors that are the same length. A data frame is a lot
like a spreadsheet or a database table. Each vector represents a column
in the table. Unlike matrices, and much like database tables, data
frames can store different types of objects in each column. Data frames
have a special attribute called row.names which represents rows’ names,
which could be useful for annotating data. Data frames are usually
created by calling read.table() or read.csv() (which we will discuss
later when we come to reading data) or data.frame().
Here we create a data frame of two
columns foo and bar, foo is an integer sequence, bar is a vector of
TRUEs and FALSEs. nrow() returns the number of rows, ncol() returns the
number of columns. Since we didn’t specified row names, we got 1,2,3,4
automatically (they printed on the left of each row).
You can refer to columns by
name
You can retrieve a specific cell in
the data frame by specifying the column name and expression to filter
rows in this column. If you want to get the blood pressure of patient
Mike:
Data frames can be converted to a
matrix by calling data.matrix() which will cast data to make it of the
same type, below it casted TRUEs and FALSEs to 1s and 0s.
R objects can have names, which is a
very useful for writing readable code and self-describing
objects.
Lists can also have names.
also matrices can have
names
In this post we introduced R and its data types and data structures. In future posts we will get more deep into R.