ggplot2 is one of the most downloaded R packages and probably the one that brought Hadley Wickham to fame.

The “grammar of graphics” philosophy it supports not only lets you create professional looking plots, but once you have mastered its syntax should encourage you to think about plots in a more structured manner too.

The syntax does take time to master though, so do take time to check out the ggplot2 website and the ggplot2 cookbook which will walk you through common tasks. The author still refers to these!

## The Mindset for “Building” a Plot

While there are lots of ways to head down into the weeds, the core of building a ggplot is the following:

1. Call the ggplot() function. Primarily, this just indicates what data set will be plotted (using the data= parameter). And, when possible and when it makes sense, doing some light mapping of that data to core aspects of the plot (using mapping = aes()). Note that you can pass the function more data than you’re actually going to plot, and you can override aspects of both data and mapping in subsequent steps.

2. Add (we actually use the + sign) one or more geom functions, which lays out the type of visualization you want. The image lower on this page shows an extensive list of these. One way to think about it is to consider a chart where you want a line for one set of data and a bar for another set of data (a not-too-uncommon thing to do in Excel). This would use a geom_bar() geom and a geom_line() geom. The data set that will be plotted is already specified (in the data= parameter of ggplot()), but we still may need to do some additional mapping within the geom. For instance, if the data= parameter was time-series data that included users, sessions, and pageviews, the mapping=aes() within the geom_line() function would need to specify which of the metrics to plot on the y axis (e.g., geom_line(mapping = aes(y = sessions))).

3. Optionally (but often needed), add (again, with the plus sign) “theme” specifications. These are used to tweak the styles: how thick of a line around the plot, whether to include major and/or minor x and or y gridlines (and what color and thickness to make them), where to locate the legend (or not to have a legend at all). Typically, we can start with a predefined theme and then just tweak (override) specific elements, similar to how CSS can be loaded from external files and then overridden with styles defined more closely to the specific element being formatted on a web page.

There are other aspects of ggplot2 that we’re not going to get into here. To scratch the surface, though, below is a brief explanation of each of the components of the ggplot2 world:

• Data source - its easiest to use a tidy data source in long format
• Aesthetics - this specifies which variables in your data will vary and be plotted. In many ways, “aesthetics” is a misnomer – you actually control the look and feel of plots using themes. So, if you want a plot that is “aesthetically pleasing,” you will spend more time adjusting the theme applied to the plot – it will have very little to do with the aesthetics (aes()).
• Coordinate systems - usually you’ll be in x-y, but polar and more exotic systems are possible
• Scales - How your variables map onto the coordinate system (e.g. a log scale)
• Statistics - Statistics applied to the data before plotting - most common is binning, such as for histograms, and smoothers such as trend lines
• Geoms - Geometric objects, the type of plot to produce. Line charts, bar charts, tiled plots etc.

Thinking about what you want to produce via the componenets above will get you to your desired plot quicker.

## Which geom…() to Use?

A great resource is the ggplot2 cheatsheet which groups geoms by the type of data you have. The first page of this cheatsheet is below, and you can always get to it from RStudio by selecting Help>>Cheatsheets>>Data Visualization with ggplot2.

## Example Workflow

This example requires having a web_data data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the details below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page.

## Warning: package 'knitr' was built under R version 3.3.2

Once you have a web_data data frame to work with, the command head(web_data) should return a table that, at least structurally, looks something like this:

kable(head(web_data))
X date channelGrouping deviceCategory sessions pageviews entrances bounces
1 2016-01-01 (Other) desktop 19 23 19 15
2 2016-01-01 (Other) mobile 112 162 112 82
3 2016-01-01 (Other) tablet 24 41 24 19
4 2016-01-01 Direct desktop 133 423 133 61
5 2016-01-01 Direct mobile 345 878 344 172
6 2016-01-01 Direct tablet 126 237 126 77

Now, we can get to visualizing!

### 1. Get the Data Ready and Tidy

While it’s possible to use “wide” data, it’s generally easiest to always start with tidy “long” data, so you can quickly repeat what you have learned/applied before.

## We can use the newer tidyr() package the gather() function to tidy up the data
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.2
library(dplyr)

## call the key column 'variable' and the value colum 'value' and
## gather all variables apart from date, channelGrouping and deviceCategory
web_data_tidy <- web_data %>%
select(-X) %>%   ## get rid of column X
gather(variable, value, -date, -channelGrouping, -deviceCategory)
head(web_data_tidy)
##         date channelGrouping deviceCategory variable value
## 1 2016-01-01         (Other)        desktop sessions    19
## 2 2016-01-01         (Other)         mobile sessions   112
## 3 2016-01-01         (Other)         tablet sessions    24
## 4 2016-01-01          Direct        desktop sessions   133
## 5 2016-01-01          Direct         mobile sessions   345
## 6 2016-01-01          Direct         tablet sessions   126

gather() is the opposite of spread() - it “unpivots” data. Its taking the 7 columns into 4.

Example:

web_data %>% filter(date == "2016-01-01", channelGrouping == "(Other)", deviceCategory=="desktop")
##   X       date channelGrouping deviceCategory sessions pageviews entrances
## 1 1 2016-01-01         (Other)        desktop       19        23        19
##   bounces
## 1      15
web_data_tidy %>% filter(date == "2016-01-01", channelGrouping == "(Other)", deviceCategory=="desktop")
##         date channelGrouping deviceCategory  variable value
## 1 2016-01-01         (Other)        desktop  sessions    19
## 2 2016-01-01         (Other)        desktop pageviews    23
## 3 2016-01-01         (Other)        desktop entrances    19
## 4 2016-01-01         (Other)        desktop   bounces    15

### 2. Make Sure All of the Columns are the Right Class

In this case, we’re going to make the date column a Date object.

You could also choose to make factors out of your categories, as they let you set the order of colours in the legends a bit easier.

Note: Outside of applying statistical methods, converting columns to be factors will often come into play when you want to control the order of nominal or ordinal variables. This gets a little confusing, in that there are “unordered” factors and “ordered” factors, and you do not actually need an ordered factor to contro the order in a plot (!). We’re not going to dive into this here, as that’s heading down into the weeds a bit. But, make a mental note that order can be controlled when using nonmetric variables. And it’s a quick Google search to get the specifics.

str(web_data_tidy)
## 'data.frame':    22928 obs. of  5 variables:
##  $date : chr "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ... ##$ channelGrouping: chr  "(Other)" "(Other)" "(Other)" "Direct" ...
##  $deviceCategory : chr "desktop" "mobile" "tablet" "desktop" ... ##$ variable       : chr  "sessions" "sessions" "sessions" "sessions" ...
##  $value : int 19 112 24 133 345 126 307 3266 1025 17 ... web_data_tidy$date <- as.Date(web_data_tidy\$date)

## we will only look at sessions
library(dplyr)
plot_data <- web_data_tidy %>% filter(variable == "sessions")

### 3. Create a Plot Object Called gg

We can use the ggplot() function to create a plot that we’re going to call gg. This call includes your data and any known aesthetics (mappings) that you want to apply to all of the plots that you will layer in. We can also go ahead and set the basic theme here. theme_minimal() is a nice, clean one to start with.

As we have made “long” tidy data, we know that our x variable will be date, but also our y variable will be in the value column, so we can set these as defaults in the aes() (aesthetics) call:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
## I don't know why, but I always call them gg
gg <- ggplot(data = plot_data, aes(x = date, y = value)) + theme_minimal()

### Let the Fun Begin!

Experiment with adding various elements to your gg object using +. Once you have found something you want to keep, assign it to gg and then carry on to the next feature.

Any aesthetics or statistics you haven’t specified in the global line, you will need to add in the geom you are adding. Note that because we have put the data in the first line, we don’t need to specify it again.

## let's make some line plots
gg + geom_line()

## hmm, too much data in there, let's colour by the channelGroupings
gg + geom_line(aes(colour = channelGrouping))

## we have desktop, mobile and tablet all in there, let's seperate them out with facet
gg + geom_line(aes(colour = channelGrouping)) + facet_grid(. ~ deviceCategory)

## I prefer it one over the other
gg + geom_line(aes(colour = channelGrouping)) + facet_grid(deviceCategory ~ .)

## let's try an area plot
gg + geom_area(aes(colour = channelGrouping, group = channelGrouping)) + facet_grid(deviceCategory ~ .)

## ahh, area plots colour by scale 'fill' rather than scale 'colour' (see ?geom_area)
gg + geom_area(aes(fill = channelGrouping, group = channelGrouping)) + facet_grid(deviceCategory ~ .)

## ok, let's keep that for now
gg <- gg + geom_area(aes(group = channelGrouping, fill = channelGrouping)) + facet_grid(deviceCategory ~ .)

The point above is to show how modifications can be quickly added as you try out ideas.

A little more styling, and we are done with this example:

## make the colours nicer
gg <- gg + scale_fill_brewer(palette = "Blues")
gg <- gg + ggtitle("Sessions per device category")
## rename the x and y axis
gg <- gg + xlab("Date") + ylab("Sessions")
## change the legend title
gg <- gg + guides(fill = guide_legend(title = "Channel Grouping"))
## put the legend at the bottom
gg <- gg + theme(legend.position = "bottom")
## print the final plot
gg

Disclaimer, I don’t think area plots are very clear but they look pretty ;)

# Another example of ggplot’s power is faceting

gg <- ggplot(data = web_data_tidy, aes(x= date, y = value)) + theme_linedraw()
gg <- gg + geom_line(aes(color = deviceCategory))
gg <- gg + facet_grid(deviceCategory ~ channelGrouping)
gg

## Lets do a bar plot

But call it the right name - geom_col (geom_bar is for count data)

gg <- ggplot(data = web_data_tidy) + theme_bw()
gg <- gg + scale_fill_brewer(palette = "Blues")
gg + geom_col(aes(x = channelGrouping, fill = deviceCategory, y = value), position = "dodge")