Now that we’ve covered classes, it’s time to get into the weeds a bit on how to actually access the data that gets stored in different objects with different classes.

For now, we’re going to use “Base R” syntax for this. Very soon, we’ll cover using dplyr when working with data frames, which does similar things in an (arguably) easier to understand syntax.

Ultimately, you will wind up using both, so we’ll start with the basics.

Subsetting Using [ ]

Most R objects can have their individual elements accessed via their numeric position (or “index”), which use a square brackets ([ ]) notation.

Vector Subsetting

Let’s start by looking at a simple vector of all of the lowercase letters and how different elements in the vector can be accessed with bracket notation:

# Create a vector that contains the letters `a` through `z`. 
a_vector <- letters

# Retrieve the first letter in the vector.
a_vector[1]
## [1] "a"
# Retrieve the fifth letter in the vector.
a_vector[5]
## [1] "e"
# Retrieve the first THROUGH fifth letters using a `:` inside the brackets.
a_vector[1:5]
## [1] "a" "b" "c" "d" "e"
# Retrieve the first AND fifth letters by putting a vector inside the brackets.
a_vector[c(1,5)]
## [1] "a" "e"

Data Frame Subsetting

With a vector, there is only one dimension, so we only need one value inside the brackets. In a data frame, though, we have both rows and columns. So, we need to specify both dimensions, which we do by extending the [ ] notation to include a comma: [ , ].

  • The position before the , indicates the row(s)
  • The position after the , indicates the column(s)

Note: This is sort of like R1C1 notation in Excel…except with a comma!

Let’s explore this notation a bit with a small little data frame of web data that we’ve called web_data (this is a very small data set, but this page is long enough without us including a more meaningful number of rows):

date channelGrouping deviceCategory sessions
2016-01-01 (Other) desktop 19
2016-01-01 (Other) mobile 112
2016-01-01 (Other) tablet 24
2016-01-01 Direct desktop 133
2016-01-01 Direct mobile 345
2016-01-01 Direct tablet 126
2016-01-01 Display desktop 307
2016-01-01 Display mobile 3266

We can access various subsets of the data frame using [ ] notation:

# Retrieve the entire second row. Note how we include the comma, but 
# we just leave the row index blank to do this.
web_data[2,]
##         date channelGrouping deviceCategory sessions
## 2 2016-01-01         (Other)         mobile      112
# Retrieve the entire fourth column. Similarly, we include the comma,
# but leave the column index blank.
web_data[,4]
## [1]   19  112   24  133  345  126  307 3266

Note: While the two examples above were similarly structured, notice how the output differs. When we retrieved an entire row, the result was a data frame (with a single row). When we retrieved an entire column, since, by definition, all of the values were of the same class, R went ahead and simplified the result to be a vector.

# Retrieve the second row, first column
web_data[2, 1]
## [1] "2016-01-01"
# Retrieve the second through fourth rows, and the first and fourth colums
web_data[2:4, c(1,4)]
##         date sessions
## 2 2016-01-01      112
## 3 2016-01-01       24
## 4 2016-01-01      133

Subsetting with Names

Subsetting by numbers assumes the rows and columns are always going to be in the same order, which can be dangerous. It’s much safer to work with names, if the names are knowable:

# Retrieve the first 5 rows of the "sessions" column
web_data[1:5, "sessions"]
## [1]  19 112  24 133 345

You can aso specify multiple columns by passing in a vector of column names:

web_data[1:5, c("channelGrouping","sessions")]
##   channelGrouping sessions
## 1         (Other)       19
## 2         (Other)      112
## 3         (Other)       24
## 4          Direct      133
## 5          Direct      345

You can even reorder or repeat columns (but it will rename them to avoid clashes, which occurs behind the scenes using the make.names() function).

web_data[1:5, c("channelGrouping","deviceCategory", "sessions", "sessions")]
##   channelGrouping deviceCategory sessions sessions.1
## 1         (Other)        desktop       19         19
## 2         (Other)         mobile      112        112
## 3         (Other)         tablet       24         24
## 4          Direct        desktop      133        133
## 5          Direct         mobile      345        345

[ ] vs. [[ ]]

When you subset lists (including data frames) with [ ] it will, generally, return a list or data frame. If, instead, you want to return the column vector, then use [[ ]] which returns what’s actually in the list (or data frame) column.

This is confusing topic. It’s right up there with StringsAsFactors = FALSE. This is where the console comes in handy when you’re trying to make sure you have your syntax correct.

There’s a bit more going on with this notation – [ ] refers to the location of the data and [[ ]] refers to the values at the location… but let’s not get into that just yet. It’s much easier to work through this when we have a real example.

The $ Operator

Now that we’re all squared away with [ ] notation, let’s look at a completely different mechanism that, as it happens, can be partnered up with [ ]s (or not): the $. The $ gets used with data frames to specify columns (based on their name) and in lists to specify elements (by their name).

# Retrieve the `sessions` column
web_data$sessions
## [1]   19  112   24  133  345  126  307 3266

The $ is actually just a shortcut to subsetting via a character name:

web_data[["sessions"]]
## [1]   19  112   24  133  345  126  307 3266

You can apply [ ] subsetting to the result of $ notation:

# The first 5 elements of the `sessions` column
web_data$sessions[1:5]
## [1]  19 112  24 133 345

Conditional Logic

There are many ways to pull subsets of data frames based on specific criteria (rather than explicitly identifying rows or columns). This is actually one of the ways that dplyr really shines over Base R. But, it’s useful to get comfortable with how Base R notation works, including what’s actually happening behind the scenes.

To use conditional logic, you actually wind up using our friends TRUE and FALSE (remember our discussion of the logical class earlier?). This can be a good way to select specific rows from a data frame.

For instance, to select all rows that are over 125 in the sessions column of web_data, we can construct a logical (class) vector:

# Create a TRUE or FALSE vector for every `session` element over 125
over_125 <- web_data$sessions > 125

# Display the result
over_125
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Now, we can pass this logical vector (which has the same number of rows as our original data frame) into the row selector for web_data:

# All rows where `sessions` is over 125... and ALL columns
web_data[over_125,]
##         date channelGrouping deviceCategory sessions
## 4 2016-01-01          Direct        desktop      133
## 5 2016-01-01          Direct         mobile      345
## 6 2016-01-01          Direct         tablet      126
## 7 2016-01-01         Display        desktop      307
## 8 2016-01-01         Display         mobile     3266

Got it? If this seems a bit cumbersome…it sort of is. And, it turns out that you can shorten this all into a single line:

web_data[web_data$sessions > 125,]
##         date channelGrouping deviceCategory sessions
## 4 2016-01-01          Direct        desktop      133
## 5 2016-01-01          Direct         mobile      345
## 6 2016-01-01          Direct         tablet      126
## 7 2016-01-01         Display        desktop      307
## 8 2016-01-01         Display         mobile     3266

Notice how web_data is the overall data frame and is used as a conditional within the subsetting of that data frame.

If you only wanted certain columns, you can add that criteria “after the comma,” as discussed earlier

web_data[web_data$sessions > 125, c("channelGrouping","sessions")]
##   channelGrouping sessions
## 4          Direct      133
## 5          Direct      345
## 6          Direct      126
## 7         Display      307
## 8         Display     3266

These can start to look pretty confusing, but, once you get comfortable with the basic syntax, you will see how things break down. And, it can be useful to build up the final syntax iteratively, much as was done in the example above, lest you wind up in nested conditional hell! (If you’re a heavy Excel user, this can be like building up beastly nested formulas.)

A couple of additional notes on the conditional selections (the use of > above):

  • To set “equals to,” use a double equals sign: ==
  • To set “not equals to,” it is not “<>” like you might think: it’s !=.

Other Subsetting Methods

The which() function is one that you may come across here and there, but, in general, we would recommend not using this, since it relies on numeric subsetting and can be difficult to debug.

If you are regular expression junkie (and what self-respecting web analyst isn’t?), you can use grepl() in your row or column selections to use regEx to identify which rows/columns to return. This is just another way to build those TRUE/FALSE vectors we just went through. We won’t go into that in detail now, but be aware that regEx and R are besties.

There is also a grep() function that actually returns the matched values, but, if you’re doing a selection, you actually want to return a logical vector (TRUEs and FALSEs) for your condition as to which rows you want to match…and that is what grepl() does.

If you have loaded dplyr() then it makes sense to use its select() for columns and filter() for rows instead…but, again, we’ll get there! (grepl() plays quite nicely with dplyr, too. Thanks for asking!)

Subsetting Meets Munging

So, now you can subset at will. But, what if you actually want to perform operations on a data frame – actually change subsets of the data (for cleanup for flagging or something else)? In many cases your data will come with elements you need to change that you need to filter down to. You can then reassign those values to what you prefer.

A few other functions are useful to know for these cases:

## Will return TRUE if a value is NA (e.g. imported incorrectly)
is.na(NA)
## [1] TRUE
a_vector <- c(1,2,3,NA,4)
is.na(a_vector)
## [1] FALSE FALSE FALSE  TRUE FALSE

Munging Example

Lets take the previous web_data columns and say we want to set all the sessions values to 125 if they are over 125.

In this case we can filter to the elements we need like before, but this time also modifying the data in place using the <- assignment command:

# Set up a new object that is just a copy of `web_data`. The following
# operations could have just been done with `web_data` directly, but then
# the original data would be lost.
my_new_data <- web_data

# Subset the data just as we did above, but then assign that subset a
# specific value.
my_new_data[my_new_data$sessions > 125, "sessions"] <- 125

# Check out what the data looks like
my_new_data
##         date channelGrouping deviceCategory sessions
## 1 2016-01-01         (Other)        desktop       19
## 2 2016-01-01         (Other)         mobile      112
## 3 2016-01-01         (Other)         tablet       24
## 4 2016-01-01          Direct        desktop      125
## 5 2016-01-01          Direct         mobile      125
## 6 2016-01-01          Direct         tablet      125
## 7 2016-01-01         Display        desktop      125
## 8 2016-01-01         Display         mobile      125