R Session 3



3.1. Data Aggregation

R provides various functions to aid in aggregating data. For simple tabulation the 'table' function is used. For more complex tasks there are two major groups of functions: those that work effectively with arrays and lists ('apply', 'sweep', 'sapply', 'mapply', 'lapply') and those that oriented toward data frames ('aggregate' and 'by'). These two tools overlap and the output from one function can be converted to the output from the other.



3.1.1. 'table'

The arguments for this function can be either individual vectors, or a list or a data frame consisted of vectors.

Create a dataset "pets":

pets=c('dog','cat','duck','chick','duck','cat', 'dog'); pets

Convert it into table:

tt<-table(pets);tt

tt['duck']

tt['dog']

Alternatively, the output from 'table' can be converted to a data frame.

<!> A data frame is a list of variables of the same length with unique row names, given class "data.frame".

as.data.frame(tt)

If you need to display margins of the table (the sum of each row and/or column)use the function 'addmargins'.

We use "infert" dataset:

infert #returns a whole table

head(infert) #returns several first lines of the table

To extract from the table a data "Education vs. Parity":

a<-table(infert$education, infert$parity); a

To add a row of margins:

a1<-addmargins(a,1); a1

To add a column of margins

a2<-addmargins(a,2); a2

To add both:

a3<-addmargins(a,c(1,2)); a3



3.1.2. Road Map for Aggregation

If you have an aggregation problem there are three main considerations:

1. How are the groups that devide the data are defined?

2. What is the nature of a data to be operated on?

3. What is the desired end result?

The most effective solution of your problem is based on these issues.

-Groups defined as list elements.

In this case 'sapply' or 'lapply' functions should be used. The difference is that 'lapply' always returns a list, while 'sapply' may return a vector or array.

-Groups defined by rows or columns of a matrix.

In this case 'apply' function is the best choice. It will return a results as a vector or array. But if you operate with the rows or columns of different dimentions it will return the results as a list.

-Groups based on one or more grouping variables.

A whole array of choice is available if you operate on subsets of data based on the value of grouping variable.

If you work with single vector and produce a single scalar as a result the 'aggregate' function is the good choice. It always returns a data frame, that is very useful if you going to create a plot or fit a statistical model to the aggregated data.

If you work with single vector but produce a vector, 'tapply' is an option. Unlike the previous function it returns a result in a vector or array. Another option is to use 'reshape' package.



3.1.3. Mapping Function to a Vector or a List.

The important use of 'sapply' relates to data frames. when treated as a list, each column of a data frame retains its mode and class.

As an example you use the build in data frame "ChickWeight", and wish to learn about nature of each column.

If use the 'class' function it will return information about whole data frame, not individual columns:

ChickWeight

head(ChickWeight)

class(ChickWeight)

To get the information for each variable (each column), use 'sapply':

sapply(ChickWeight,class)



3.1.4. Mapping a function to Matrix or Array.

If you have your data as an array you use the 'apply' function. It will operate on each dimension of the data. This function requires 3 arguments: the array where nto perform the operation, an index telling which dimension to operate on, and the function to use. We will use this function to calculate the mean and standard deviation for the data from build-in data frame "state.x77":

To open data frame:

b<-state.x77; b

head(b)

To calculate the mean and standard deviation of each column

b1<-function(x)c(n=sum(!is.na(x)), mean=mean(x), sd=sd(x))

x<-apply(b,2,b1); x

t(x) #change the position of columns and rows



3.1.5. Mapping a function based on Groups.

To calculate scalar data summaries of 1 or more columns of a data frame or matrix, the 'aggregate' function can be used. Open a build-in dataset "iris", containing the values of 4 variables measured on a variety of samples from 3 species of irises.

iris

head(iris)

To find the means of all 4 variables by species use 'aggregate' function:

aggregate(iris[-5], iris[5], mean) #5 is the number of column "Species". [-5] means we take data w/out this column to calculate mean.



To calculate the mean of one variable depended on 2 other parameters you also can use 'aggregate' funtion:

head(ChickWeight)

dim(ChickWeight) #there were 50 chickens and the weight was measured on 12 different days (column "Time")

c1<-aggregate(ChickWeight$weight, ChickWeight[c('Time', 'Diet')], mean)

head(c1)

dim(c1) # the aggregated data has less rows because the mean "Weight" was calculated for all 50 Chickens on 12 different days.

To find the maximum "Sepal.Length" for different "species" of "iris" we also can use function 'tapply':

c<-tapply(iris$Sepal.Length, iris$Species, max);c



3.2. Reshaping Data

Sometimes the data may be necessary to modify to suit a function of interest. The focus of this chapter is working and manipulating with data frames.



3.2.1. Modifying Data Frame Variables

As an example consider that in the "ChickWeight" dataset we need to create one more column representing the log of the "weight". We can use different methods to do so.

head(ChickWeight)

First way:

ChickWeight$Logweight<-log(ChickWeight$weight);head(ChickWeight)

Second way:

with(ChickWeight, log(weight)); head(ChickWeight)

Third way: use the function 'transform'

ChickWeight1<-transform(ChickWeight, Logweight=log(weight)); head(ChickWeight1)

***The system version of this data set is unaltered, but the version in the current workspace will have the new column in this table.



3.2.2. Recoding variables

As an example we have a build-in "iris" dataframe. Consider that we need to create a new variable "bigsepal" which is TRUE when "Sepal.Length" parameter is greater than 6, and FALSE otherwise.

head(iris)

bigsepal<-iris$Sepal.Length >6; bigsepal

We receive a list of logical variables which can be manipulated to create categorical variables with more than 2 levels.

To create a categorical variable "sepalgroup" based on Sepal.Length, which would be equal to 1 for lengths less or equal to 5, 2 for lengths between 5 and 7, and 3 for lengths greater than or equal to 7.

sepalgroup<- 1+(iris$Sepal.Length >=5)+ (iris$Sepal.Length >=7); sepalgroup

The same result could be achieved using the function 'cut':

sepalgroup2<- cut(iris$Sepal.Length, c(0,5,7,10), include.lowest=TRUE, right=FALSE); sepalgroup2



3.2.3. The 'recode' function

You can recode your variables using 'recode' function from the 'car' package.

Using this function you can manipulate on single values, multiple values (for example c(1,5)=5); ranges of values (for example 1:7).

install.packages('car')

library(car)

f<-c(1:10); f

newgroup<-recode (f, 'c(1,5)=1; c(2,4)=2; else=3'); newgroup



3.2.4. Reshaping Data Frames

Sometimes you'll need to fork with the dataframe in the "long" format, sometimes in the "wide" format.The 'reshape' function converts dtatsets between these 2 forms. Suppose we have some subject measured at certain time and received 2 responsable variables.

Create a data set:

g<- data.frame(subj=rep(1:4, rep(3,4)), time=rep(1:3), x=rnorm(12), y=rnorm(12)); g

wideg<-reshape(g, idvar='subj', v.names=c('x','y'), timevar='time', direction='wide'); wideg

Notice that the names "x.1", "y.1", etc. were formed by joining together the variable names of "v.names" and "timevar".

To reshape a table back use this function one more time:

g1<-reshape(wideg); g1



3.2.5. The 'reshape' package

The 'reshape' package "melts" a dataset (through the 'melt' function) into a dataframe. This melting operation is essentially a "wide-to-long" reshaping of the data.

Install the package:

install.packages('reshape')

library(reshape)

Using the "usp" dataframe we can easily convert the melted form to the long form:

usp<-data.frame(type=rownames(USPersonalExpenditure), USPersonalExpenditure, row.names=NULL); usp

musp<-melt(usp); head(musp)

To complete this table we need to remove "X", rename "variable" column into "time", and rename "value" into "expend".

musp$variable=as.numeric(sub('X', '', musp$variable))

names(musp)[2:3]=c('time', 'expend')

head(musp)

Notice, that "variable" is a factor and 'sub' function converts it to a character before operating on it; if you use it directly you'll pass it to 'as.character' before processing.

The "long-to-wide" convertion provides using function 'cast'. We use the same data set as in section 3.2.4.:

g<- data.frame(subj=rep(1:4, rep(3,4)), time=rep(1:3), x=rnorm(12), y=rnorm(12))

mg <- melt(g); mg

cast(subj ~ variable + time, data=mg)

To separate each time into separate list element, the vertical bar (|) is used:

cast(subj ~ variable|time, data=mg)

The 'cast' function operates on the created dataset instead of original one. And to create a wide data frame from stimulated data, but include only "x"-values, you could use:

cast(subj ~ variable+time, subset=variable=='x', data=mg)



3.2.6. Combining Data Frames

To combine 2 dataframes by rows you could use the function 'rbind', and by columns - 'cbind'. The 'rbind' require the dataframes to have the same number of columns, and for using 'cbind' the dataframe must have the same nmber of rows <!!!!!!!>

'rbind' function demands that the names of the objects combined agree, 'cbind’ doesn't do this checking. Therefore to combine dataframes based on the values of common variables, the merge function should be used.

Suppose you have 2 dataframes with the different length:

x<-data.frame(a=c(1,2,4,5,6), x=c(9,12,14,21,8))

y<-data.frame(a=c(1,3,4,6), y=c(8,14,19,2))

merge(x,y)

Notice, that only rows with values of "a" in both data frames are represented in the output.

To see them all you need to modify this, using the arguments: all=TRUE, all.x=TRUE, all.y=TRUE.

merge(x,y,all=TRUE)

merge(x,y,all.x=TRUE)

merge(x,y,all.y=TRUE)

Note that missing values (NA) are inserted in the places where data was missing from one of the dataframes.