### Title: ECOL592 Introduction to R Lecture 3 ### Date created: 20140105 ### Last updated: 20140202 ### ### ### Author: Michael Koontz ### ### ### Intention: Script file for ECOL592 Intro to R Lecture 3. Data summary and subsetting ### ### ### Basic summaries ### ### # Let's use the built-in dataset 'CO2' head(CO2) # When we access data from a data frame, we can then pass it to functions as arguments # To calculate the mean 'uptake', access that column and pass it to the mean() function as an argument mean(CO2$uptake) # Another handy function. Returns the '5 number summary' (minimum, maximum, 1st and 3rd quartiles, and the mean) as well as the median summary(CO2$uptake) # What does summary(CO2$uptake)[4] return? # What does names(summary(CO2$uptake)) return? # Subsetting # Suppose you only want part of your data set # You can subset it in a few different ways # First, we can use the bracket notation to subset # Whereas before, we saw CO2[4, ] to return ONLY the 4th row, but ALL of the columns, we can put a conditional statement in place of the 4 CO2[CO2$Treatment=="nonchilled", ] # access ALL of the columns (space after comma), but only # the rows whose 'Treatment' column is equivalent to "nonchilled" # Anatomy CO2$treatment=="nonchilled" # Returns a vector of TRUE and FALSE for whether there's a row with "nonchilled" in the treatment column # What is wrong with: CO2[CO2$type=="Mississippi", ] # How would you calculate the mean of the Mississippi plants' uptake? # How would you access all columns from the rows representing plants of type 'Mississippi' AND treatment 'nonchilled'? # How would you access the uptake of plants of type 'Quebec' that experienced concentration of CO2 greater than or equal to 350? # How would you change the uptake of chilled plants from Mississippi experiencing less than 175 CO2 concentration to NAs? ### Finding NAs # You may find yourself looking for the NAs in your data set # Use subsetting and the is.na() function to do it quickly # is.na() takes a vector of any length and returns a logical vector (TRUE or FALSE) with TRUE at the indicies where the vector passed as an argument contains an NA NA.containing.vector <- c(45, 63, NA, 23, 12, 1, 1, 0) is.na(NA.containing.vector) # So when it comes time to pick out where the NAs are in the vector, use the vector name and bracket notation # This example is a bit trivial, because of course displaying the contents of the vector that are NA is going to show NA NA.containing.vector[is.na(NA.containing.vector)] # What about the NOT NAs? Use '!' NA.containing.vector[!is.na(NA.containing.vector)] # The NA is gone! #This can be super useful in a data frame subset: CO2[CO2$Type=="Quebec" & CO2$conc<250, "uptake"] <- NA CO2 CO2[is.na(CO2$uptake), ] # or CO2.no.NA <- CO2[!is.na(CO2$uptake), ] # Notice the difference in size dim(CO2) dim(CO2.no.NA) #################### ## Handy Function ## #################### # Subset is pretty straightforward, but not as flexible as using bracket notation # Since subset is a function and returns a data frame, you'd have to access particular columnns of the returned data frame test <- subset(CO2, subset=Treatment=="nonchilled") test$uptake ### ### ### Manipulations of data frames ### ### # We already saw how to add columns by using the '$' operator #cbind() combines columns; must have the same number of rows; this is very different from c()!! cbind(CO2[,1:2], CO2$conc, CO2$uptake) #rbind() combines rows; columns must match in number and data type double.CO2 <- rbind(CO2, CO2) #what is the difference in size between these two data frames? ### ### ### Advanced summary of data ### ### ########################### ## Super useful function ## ########################### # aggregate() # summarizes a data frame using a function of your choosing and by variables of your choosing # Uses the 'formula' syntax aggregate(uptake ~ Treatment, data=CO2, FUN=mean) aggregate(uptake ~ Type, data=CO2, FUN=max) aggregate(uptake ~ Treatment + Type, data=CO2, FUN=mean) #Can do more than 1 function if you set the parameters up with FUN=function(x) first and then use c() to list the functions aggregate(uptake ~ Type, data=CO2, FUN=function(x) c(min(x), max(x))) # Anatomy # Returns a data frame with number of rows equivalent to the number of unique combinations of what you have on the right side of the ~ and the number of columns equal to the number of variables you included on the right side of the ~ PLUS ONE # The last column is tricky because it is actually a single matrix # We saw from the last aggregate call using two different functions, that the result of each is in separate columns, but the overall data frame still only has 2 columns-- the second column is a matrix with 2 columns. Confusing! # What I typically do if I want to use these results for later calculations is to coerce the whole thing to a single data frame CO2.min.max <- aggregate(uptake ~ Type + Treatment, data=CO2, FUN=function(x) c(min(x), max(x))) str(CO2.min.max) # Housekeeping # First convert that last column (the matrix) to a data frame place.holder <- as.data.frame(CO2.min.max[,3]) # Use cbind() to combine the first two 'information' columns of the data frame returned from aggregate # Note how I use the ':' shortcut to make a vector CO2.min.max <- cbind(CO2.min.max[,1:2], place.holder) # Use the names function to rename those last two columns to something useful # Note that the names of the data frame that were returned by aggregate carry over, so I only access the 3rd and 4th element in the vector that names() returns and rename just those names(CO2.min.max)[3:4] <- c("min", "max") # Now it is all one big, happily labelled data frame str(CO2.min.max) ############### End lecture 3