This section covers an important strategy when you are cleaning and prepping your data, which is to aim to create tidy data. This is a specific approach for establishing how data should be structured before you work with it. It is a concept/approach that is embraced by the godfather of R, Hadley Wickham, and, as such, is supported by his collection of packages (the so-called “tidyverse”).

Tidy Data

There are a great many resources on tidy data that will explain it better than we can here, so we will just give it a brief summary and leave some links for further reading. But, the main point is that we embrace its philosophy, and ,as a data analyst, it is wise to aim for tidy data after you have imported it.

What Is Tidy Data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way” - Hadley Wickham

The aim of a tidy dataset is to present it in a manner that further processing can be done in an organised fashion:

  1. Each variable in the data set is placed in its own column
  2. Each observation is placed in its own row
  3. Each value is placed in its own cell

If you are a heavy user of Excel (and, specifically, Excel pivot tables), then you can think about tidy data as being data that is extremely “pivot-friendly.” Think about a time when you knew you needed a pivot table, but the raw data had dimensions both in rows and columns (e.g., Campaign in rows and Device Category in columns, with Sessions in the cells). If you found yourself doing some manual or formula-heavy transformation of the raw data just to get it to something you wanted to use to feed a pivot table, then you’ve experienced non-tidy data.

Tidy data isn’t necessarily data that is easiest to read by a human, but it is well suited for passing into further R functions, especially those within the tidyverse.

Two bits of good news about tidy data:

  • For both Google Analytics and Adobe Analytics, data generally comes out of the API already in a tidy format.
  • There are functions within the tidyverse (the tidyr package) specifically for converting data to/from a tidy format.

The Tidyverse

Tidyverse is the new name for the Hadleyverse, which is a collection of R packages created or supported by Hadley Wickham.

These packages aim to cover the whole spectrum of data analysis within R, each supporting the other in concepts and output. The Big Three packages within the tidyverse are:

  • dplyr - data manipulation of data frames
  • tidyr - tools to tidy (and untidy) data frames
  • ggplot2 - plotting tidy data

Tip: Note that the three packages above have pretty unique names. That makes them handy to include in Google searches when you are looking for details on how to perform a specific operation or troubleshoot a specific issue (e.g., “filter data by column name with dplyr”). Including one of these package names virtually guarantees that the search results will all be R-specific.

In addition, some other packages in the tidyverse that you may come across are:

  • tibble - creating more user-friendly data.frames
  • purrr - more general data manipulation tools
  • magrittr - the origin of %>%, extensivly used in the tidyverse (This isn’t actually a package developed by Wickham, but his packages have a dependency on this one; dropping, “Oh, yeah. The pipe in dplyr actually comes from the magrittr package” in casual conversation with R folk will provide a nominal degree of street cred.)
  • broom - turn statistical models into tidy data frames/tiddles

This list is just a subset of the tidyverse packages, but, the tidyverse has grown so much in popularity that you can now simply install/load the ‘tidyverse’ package, which will install all of these packages (and a few more) all at once.

Whilst you can achieve what the above packages do in base R or with other packages (data.table() is a noteable alternative to dplyr()), if you embrace the tidyverse, then you will need to become familiar with some, if not all, of the packages above.