The dplyr (pronounced DEE ply er) package is one of those packages that, consistently, newcomers to R do not know about and who then get confused by some (one) aspect of it…but experienced R users seldom write a script without using.

The dplyr Functions

dplyr has just a handful of functions, all of which are geared towards doing basic manipulation of data sets in a fairly straightforward manner We’re not going to go into all of the details of using these functions, as there are plenty of write-ups on that (like this one). But, we will at least provide a brief description of the functions and, at a high level, what they do:

  • filter() – used to subset the rows of a data set
  • select() – used to subset the columns of a data set
  • arrange() – used to sort the rows of a data set
  • distinct() – used to select only distinct/unique rows in a data set
  • mutate() – used to add new columns that are based on calculations on data in other columns (e.g., creating a new column with a conversion rate by dividing the orders column by the sessions column)
  • summarise() – used to perform summary calculations (mean, max, etc.) on a set of data (this is generally used in conjuntion with the group_by() function)
  • n() – used to count the number of rows in a data set (or subset)
  • sample_n() – used to return a sample from the data set

A key aspect of all of these functions is that the first argument is always the data set. There is some magic as to how this works in conjunction with the pipe (%>%), which we’re going to tackle next.

The Pipe: %>%

First off, the pipe is not something that Hadley Wickham created, nor is it actually functionality that he implemented natively within the package. However, once you load dplyr, the pipe is available to you, because dplyr loads the magrittr package, and that is where the pipe originated in R.

The pipe is, simply, a combination of three characters: %>%. When used properly, it does two things:

  1. It shortens and simplifies the code
  2. It makes the code intuitive to read

All the pipe does is provide “forward application” of an object to a function.

Huh? That doesn’t help!

Okay, let’s try again: the pipe lets you string together a series of functions – passing the result of one function directly into another function in the sequence that you want them applied and without creating temporary variables. There are two keys to this:

  • By default, the function to the right of the %>% assumes the value it is receiving is the first argument for the function, so the first argument is simply omitted in each “downstream”" function.

  • If you’re using a function where you don’t want the result of the previous function to be the first argument – you want it to be some other argument – then simply write the function as you normally would, but put a . in the position where you want the upstream function’s result to go.

The second bullet above is confusing…so just file it away. You’ll know it when you need it, and it will make perfect sense. We’re not going to need it here.

An Example to Illustrate

Let’s consider our web analytics data (web_data) that has sessions by date, device category, and channel:

##         Date  Device Channel Sessions
## 1 2016-01-01 desktop (Other)       19
## 2 2016-01-01  mobile (Other)      112
## 3 2016-01-01  tablet (Other)       24
## 4 2016-01-01 desktop  Direct      133
## 5 2016-01-01  mobile  Direct      345
## 6 2016-01-01  tablet  Direct      126

Now, suppose we want to find the average number of sessions for Display traffic when the Display traffic from mobile for the day was greater than 2 000 sessions.

Option 1: LONG Form

First, the longest way to do this (but, really, okay when you’re starting out – it works!):

# Get the subset of data that is display traffic
display_traffic <- web_data[web_data$Channel == "Display",]

# Get the subset of *that* traffic that is mobile
mobile_display <- display_traffic[display_traffic$Device == "mobile",]

# Get the subset of *that* traffic that is greater than 2 000 sessions
final_data <- mobile_display[mobile_display$Sessions > 2000,]

# Calculate the average sessions
avg <- mean(final_data$Sessions)

# Round to the nearest whole number and print the results
## [1] 3273

Option 2: Short(er)

Option 1 was an extreme. So, we may realize that we can combine a bunch of the subsetting operations into a single command:

# Subset the data in one fell swoop
final_data <- web_data[(web_data$Channel == "Display" & web_data$Device == "mobile" 
                       & web_data$Sessions > 2000),]

# Calculate, round, and print all at once
## [1] 3273

Option 3: Excel-Like

We could get this all down to a single line (although we need line breaks to control the wrapping):

round(mean(web_data[(web_data$Channel == "Display" & web_data$Device == "mobile" 
                       & web_data$Sessions > 2000),]$Sessions),0)
## [1] 3273

Does this start to remind you of something in Excel? The dreaded heavy nesting of functions!

Option Ideal: dplyr

With the pipe and some dplyr functions, we can perform these operations in a way that is both efficient and easy to read:

# We'll start to need the dplyr library

web_data %>% filter(Channel == "Display",
                   Device == "mobile",
                   Sessions > 2000) %>%
  summarise(mean(Sessions)) %>%
##   mean(Sessions)
## 1           3273

As soon as you understand how the pipe works, the code above starts to be super-readable. You could read it like this:

  1. Take the web data.
  2. Filter it to just the Display / mobile / >2000 rows.
  3. Summarise the resulting data subset by taking the mean of “Sessions.”
  4. Round the mean to the nearest integer.

Can you see how the %>% is really just a way to say, “with the result of what you just did…now do this?”

Remember: Each of the functions above actually has a first argument that is “the data the function should act on.” For instance, if I wanted to round pi to 4 digits, I would use the full round() function:

## [1] 3.1416

But, in our example above, we simply used round(0). That’s because the first argument was omitted – it was just assumed to be the value resulting from the preceding function. We could have written the piped function this way, too:

# We'll start to need the dplyr library

web_data %>% filter(.,Channel == "Display",
                   Device == "mobile",
                   Sessions > 2000) %>%
  summarise(.,mean(Sessions)) %>%
##   mean(Sessions)
## 1           3273

It returns the same result, and you would never really write it this way. But, when using pipes with functions that are not dplyr functions, sometimes, you don’t want the result of a function to be passed into the first argument. So, the . comes in handy!