We’ve touched on this subject already, but a more detailed look at the types – “classes” – of objects R uses is worthwhile.

# Useful Class Functions

There are several functions that you will seldom – if ever – use in a script, but that can come in very handy in the console when writing and debugging your code. To illustrate, we’re going to use part of the web_data data that gets used throughout this site. This data looks like this:

date channelGrouping deviceCategory sessions
2016-01-01 (Other) desktop 19
2016-01-01 (Other) mobile 112
2016-01-01 (Other) tablet 24
2016-01-01 Direct desktop 133
2016-01-01 Direct mobile 345
2016-01-01 Direct tablet 126

## str()

The str() function provides the “structure” of an object. In addition to providing the class of the entire object and, as applicable, each component of the object, it includes some of the actual data within the object. As such, it can really be your go-to function, once you’re used to reading the output.

str(web_data)
## 'data.frame':    5732 obs. of  4 variables:
##  $date : chr "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ... ##$ channelGrouping: chr  "(Other)" "(Other)" "(Other)" "Direct" ...
##  $deviceCategory : chr "desktop" "mobile" "tablet" "desktop" ... ##$ sessions       : int  19 112 24 133 345 126 307 3266 1025 17 ...

What this output is saying is that web_data is a “data frame,” and then it provides the class for each column of data within the data frame, as well as a few sample values.

## class()

The class() function provides just the class of the specified object. Let’s see what that looks like for web_data:

class(web_data)
## [1] "data.frame"

Notice how this is just a subset of what was returned by str() – it doesn’t get into the weeds of the underlying data within the overall object. In this case, it just tells you that, overall, web_data is a data frame.

Without worrying too much about the syntax (we’ll get to that very shortly), note that we can check the class of something inside the object. Let’s check the class for the sessions column in the data:

class(web_data$sessions) ## [1] "integer" Can you find this same information in the str() output above? You may be asking, “Why would I ever use class()?” The answer is really that, as you get more complex objects (data), the str() function can start to return a lot of information. If you’re really interested in determining the class of something deep inside a complex object, class() can be more concise and preferable. This topic is inherently circular, in that we’ve introduced different classes without actually explaining what they are. That’s what we’ll do now (but we’ll use class() function to inspect sample data as we go, which is why we had to cover it first!). # Atomic Classes There are many different types of classes, and you can make your own, so this is not a definitive list, but it will cover the major ones: ## Numeric The numeric class is simply a number like 1 2 3. Sometimes this will show up as integer (no decimals), and sometimes this will show up as double (has decimals). R will pick which one is appropriate, but you can force one or the other by using the as.integer() and as.double() functions (with the authors of this site cannot recall ever using, but, the fact that their assumptions that these functions existed – and what they were named – bore out as true was gratifying). The metrics from your web analytics platform will almost always arrive as numeric class objects. ## Character The character class is just what it says – a text-based string like hello or mobile or Paid Search. Of course, you can have numerical digits stored in a character class object…but you generally do not want that. (If you’re an Excel junkie, thing of this as being one of those cases where you wind up needing to do a “convert text to numbers” operation). ## Logical This is the Boolean class: TRUE or FALSE. This may seem like it’s more of a corner case class, but it really isn’t – there are any number of operations in R which, under the hood, are actually generating a bunch of TRUE/FALSE flags. So, it’s good to know that these are a special class unto themselves – distinct from a character-class object storing the strings "TRUE" and "FALSE". ## Date Date-class objects are objects that store a (wait for it!) date. Things can actually get a little tricky here, since you cannot tell whether a value is a Date-class object or a character-class object simply by looking at the data. Yet, they are fundamentally different things (and are generally pretty easy to convert from one class to the other – and back, if that’s your thing). # Get the current date and assign it to an object called a_date a_date <- Sys.Date() # Display the result a_date ## [1] "2017-10-04" # Check its class class(a_date) ## [1] "Date" So far, so good. But, let’s set the date as a character class instead – simply be creating it as a string: # Create an object called a_character and assign a "date" to it a_character <- "2017-04-17" # Display the result a_character ## [1] "2017-04-17" Uh-oh! Compare that result to a_date above. Do you see any difference? There is none (apparently)! But, yet, there is! Let’s check the class of a_character: # Check the class class(a_character) ## [1] "character" Things like finding date ranges or weekdays will work on a Date object, but not on a character. And, depending on the package and the function, you may need to pass in “dates” as either character class objects or as Date class objects. ## Factor Factors are really nothing more than categorical variables… except factors are both brilliant and can be frustrating. The reason? Well, they look the same as character when printed but, they act quite differently. Let’s take a look. Again, we’re not going to get into the details of the wherefore and the why just yet, but factors will almost certainly bite you in the tush at some point. Probably multiple times. At that point, it will become old hat for you to remember to include stringsAsFactors = FALSE in functions where that matters. But we’re getting wayyyyy ahead of ourselves. Let’s define two objects (variables) – one as a factor, and one as a string (character): # Create a_factor as a factor a_factor <- factor("hello", levels = c("hello","goodbye")) # Take a look at what we just created. Notice the "Levels:" get listed. That's # curious, is it not? a_factor ## [1] hello ## Levels: hello goodbye # And, let's check the class of the object. No surprises! class(a_factor) ## [1] "factor" Now, let’s define our string object: # Create a_string. Since we're assigning a string value to it and not telling R # anything special, it's going to go ahead and create it as a character class. a_string <- "hello" # Take a look at what we just created. a_string ## [1] "hello" # And, check its class. See! Character! class(a_string) ## [1] "character" As an example, see what happens when we try to combine a string and a factor using the c() function: c(a_factor, a_string) ## [1] "1" "hello" Huh? Weird! Whats going on? Well, since a_factor is a factor, it is actually represented as a number out of the choice of levels it could possibly be (c("hello","goodbye")). When it is added to the character the factor is coerced into a character via as.numeric, and then into a character as.character. The upshot of this all is to be very careful in making sure your variables are the class you expect them to be! A classic mistake is to use data.frame() or read.csv() to make a data.frame from your data, but to not set the stringsAsFactors = FALSE argument (I told you we’d get to this!), which, if not used will default to using factors instead. <whew> Still with us? Good! So far, all we’ve covered are the atomic classes. Things get more fun (and more powerful) when we start digging into multi-classes! # Multi-classes These are objects in R that work with combinations of the classes above. ## Vector A vector is a combination of the atomic elements above. You can only combine elements that are all of the same type in a vector, and you create a vector using the c() function. a_vector <- c("a","b","c","d") a_vector ## [1] "a" "b" "c" "d" class(a_vector) ## [1] "character" str(a_vector) ## chr [1:4] "a" "b" "c" "d" The class of the vector is the same as the element! This hints at a powerful feature of R: vectorisation. The atomic elements above are actually a vector of length 1, which means that anything you can do to one element can also be done to the entire vector of the same class all at once! An example of this: # Sum individual elements sum(1,2,3,3,4,5,6) ## [1] 24 # Sum a vector a_vector <- c(1,2,3,3,4,5,6) sum(a_vector) ## [1] 24 Some useful shortcuts with vectors are below: # Make a sequence 1:10 ## [1] 1 2 3 4 5 6 7 8 9 10 # The lowercase letters letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" ## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z" # The uppercase letters LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" ## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z" ## Data Frame The most common and useful R class is the data frame. Remember our web_data object from earlier? Let’s take one more look at that to confirm that it’s a data frame: class(web_data) ## [1] "data.frame" str(web_data) ## 'data.frame': 5732 obs. of 4 variables: ##$ date           : chr  "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
##  $channelGrouping: chr "(Other)" "(Other)" "(Other)" "Direct" ... ##$ deviceCategory : chr  "desktop" "mobile" "tablet" "desktop" ...
##  $sessions : int 19 112 24 133 345 126 307 3266 1025 17 ... Data frames are most often used to represent tabular data, and many R functions are designed to operate on data frames. Data frames can be manually created using the data.frame() function: # Names before the =, values after it. my_data_frame <- data.frame(numbers = 1:5, letters = c("a","b","c","d","e"), logic = c(TRUE, FALSE, FALSE, TRUE, TRUE)) class(my_data_frame) ## [1] "data.frame" str(my_data_frame) ## 'data.frame': 5 obs. of 3 variables: ##$ numbers: int  1 2 3 4 5
##  $letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 ##$ logic  : logi  TRUE FALSE FALSE TRUE TRUE

Each column can only be one class, but the class of the columns can be different types.

And, uh-oh, did you see that we wound up with our characters being turned into factors? We can avoid this by including the stringsAsFactors = FALSE argument:

# Names before the =, values after it.
my_data_frame <- data.frame(numbers = 1:5,
letters = c("a","b","c","d","e"),
logic = c(TRUE, FALSE, FALSE, TRUE, TRUE),
stringsAsFactors = FALSE)
class(my_data_frame)
## [1] "data.frame"
str(my_data_frame)
## 'data.frame':    5 obs. of  3 variables:
##  $numbers: int 1 2 3 4 5 ##$ letters: chr  "a" "b" "c" "d" ...
##  $logic : logi TRUE FALSE FALSE TRUE TRUE You can access the individual columns of a data frame using $ notation:

# The column of numbers
my_data_frame$numbers ## [1] 1 2 3 4 5 class(my_data_frame$numbers)
## [1] "integer"

Data frames are a special case of the next multi-class – the list. Data frames, at there core, are just lists where all of the columns are equal length.

## List

A list is like a data frame, but it can carry variable lengths of objects. And, list elements can be anything, including data frames or even other lists! They can get really, really confusing. But, they also can be very handy.

my_list <- list(letters_data = letters,
numbers_data = 1:5,
all_data = my_data_frame,
nested = list(LETTERS))
class(my_list)
## [1] "list"
str(my_list)
## List of 4
##  $letters_data: chr [1:26] "a" "b" "c" "d" ... ##$ numbers_data: int [1:5] 1 2 3 4 5
##  $all_data :'data.frame': 5 obs. of 3 variables: ## ..$ numbers: int [1:5] 1 2 3 4 5
##   ..$letters: chr [1:5] "a" "b" "c" "d" ... ## ..$ logic  : logi [1:5] TRUE FALSE FALSE TRUE TRUE
##  $nested :List of 1 ## ..$ : chr [1:26] "A" "B" "C" "D" ...

Just like data frames (because, at their core, data frames are actually lists) you can access individual elements in the list using the $ symbol: extract <- my_list$all_data
class(extract)
## [1] "data.frame"
str(extract)
## 'data.frame':    5 obs. of  3 variables:
##  $numbers: int 1 2 3 4 5 ##$ letters: chr  "a" "b" "c" "d" ...
##  \$ logic  : logi  TRUE FALSE FALSE TRUE TRUE

# Coercing

If you find an R object is the wrong class for the function you need, what can you do? This is where coercian comes into play.

All the classes have an as.this function, which, when you pass an R object in, will try to change it to what you need. It will usually throw an error if this is impossible (which is much better than failing silently!).

Some coercing functions as shown below:

# Quotes indicate characters
as.character(-1:3)
## [1] "-1" "0"  "1"  "2"  "3"
# 0 is FALSE, everything else is TRUE
as.logical(-1:3)
## [1]  TRUE FALSE  TRUE  TRUE  TRUE
# Character to date
as.Date("2015-01-02")
## [1] "2015-01-02"
# If your dates are in format other than YYYY-MM-DD, then you need to include
# the format argument
as.Date("20150102", format = "%Y%m%d")
## [1] "2015-01-02"
as.Date("12-24-2016", format = "%m-%d-%Y")
## [1] "2016-12-24"
# To change factors to numeric, be careful to go via as.character first
numeric_factor <- factor(1, levels = 5:1)
numeric_factor
## [1] 1
## Levels: 5 4 3 2 1
# This gives the result as 5, as that's the first factor
wrong_factor <- as.numeric(numeric_factor)
wrong_factor
## [1] 5
# But, if we use as.character, then we get what's expected/desired.
right_factor <- as.numeric(as.character(numeric_factor))
right_factor
## [1] 1

# In Summary

As you start to work with R, you will find that you are working with a range of classes, and, when it comes to multi-classes, you will find yourself wanting everything to be a vector or a data frame… until you hit a use case where, all of the sudden, you need more flexibility in the structure (including data nested within other data), at which point you will find yourself in list world. Lists can be maddening…until they’re not. Do you remember when you were first learning how to use pivot tables in Excel? It’s the same thing: kinda’ confusing at first, but kinda’ genius once you’re comfortable with them!