In the Nature of Data, we explore nonmetric data (e.g., nominal, interval) and metric data (e.g., interval, ratio).

In Cross Tabulations with a $$\chi2$$ Test for Independence, we test the relationship between two nominal variables with at least two levels.

In 1-way ANOVA, we test the relationship between at least one independent variable that is nominal in nature with at least two levels and one dependent variable that is interval or ratio in nature. In n-way ANOVA, we add two independent variables, which remain nominal in nature.

In this document, we will add another test for a relationship between variables as well as a test for prediction.

Correlation is a number that describes how strong of a relationship there is between two variables.

In digital analytics terms, you can use it to explore relationships between web metrics to see if an influence can be inferred, but be careful to not hastily jump to conclusions that do not account for other factors.

For instance, a high correlation between social shares and SEO position could mean:

• Social shares influence SEO position
• SEO position influences social shares
• Social shares and SEO position are influenced by a third factor (such as Brand strength)
• The relationship was a chance error

It is, unfortunately, pretty common to see something like the first bullet used as the sole interpretation of a correlation, which is problematic for two reasons:

• There might be other factors in play (the other bullets in the list!)
• Correlation is not necessarily a sign of causation.

But, still, correlation can be very useful: identifying that a relationship exists can be a great place to start looking for the underlying drivers of that relationship which, ultimately, can lead to an insight than can drive an action!

## The Basics of Correlation

Correlation refers to a value, $$r$$, that provides insight into the relationship between two variables. This mathematical formula goes by many names including Pearson product moment correlation, bivariate correlation, Pearson’s r, and others.

Mathematically, the correlation coefficient is determined by

1. Summing the product of standardized score for $$X_1$$ variable and standardized score for a corresponding $$X_2$$ variable, and then
2. Dividing by the number of paired $$X_1X_2$$. Most statistical packages, including Excel, Minitab, R, SAS, and SPSS quickly, easily, and painlessly perform this calculation.

At this point, we will stop discussing the mathematical formula and instead focus on the output and the ensuing interpretation. However, we will come back to the correlation coefficient, $$r$$, when we discuss regression.

In this example, we will consider five metric variables, including:

1. Average amount spent per transactions
2. Number of transactions
3. Number of pages viewed
4. Number of unique visitors by week
5. Size of discount (e.g., 50% off, $10 off) All of these numbers are fake and are used for only illustrative purposes. Proceed with that caveat. Table 1: Correlation Coefficient for Five Variables Average Amount Spent per Transaction Number of Transactions Number of Pages Viewed Number of Users by Week Size of Discount Average Amount Spent Per Transaction 1.00 Number of Transaction 0.89 1.00 Number of Pages Viewed 0.43 0.29 1.00 Number of Users by Week 0.67 0.92 0.75 1.00 Size of Discount 0.32 0.71 0.19 0.93 1.00 All the correlation coefficient values shown in Table 1 appear positive. As $$X_1$$ moves in a positive direction (out from the origin on the X axis), another variable (e.g., number of unique pages visited or $$X_3$$) moves in a positive (up from the origin on the Y axis) direction. Conversely, negative correlation coefficient values would be interpreted as $$X_1$$ moves in a positive direction (out from the origin on the X axis), $$X_3$$ moves in a negative direction (down toward the origin on the Y axis). Correlation coefficient values range from -1.00 to 1.00 with 0.00 interpreted perfectly unrelated, -1.00 perfectly negatively related, and 1.00 perfectly positively related. Most analysts will not observe a 0, -1, and/or 1 ( the world is messy!). Table 2 - Interpretation for the Remaining Values High.Value Low.Value Interpretation 1.00 0.80 Danger! Collinearity! 0.79 0.60 Strong relationship 0.59 0.40 Moderate relationship 0.39 0.20 Weak relationship ~0.20 ~-0.20 Crud relationship -0.39 -0.20 Weak relationship -0.59 -0.40 Moderate relationship -0.79 -0.60 Strong relationship -1.00 -0.80 Danger! Collinearity! “Crud factor” comes from philosopher Paul Mehl, who noted that everything is correlated with everything at least 0.2. In our example, the number of pages viewed and size of discount appears as a crud relationship. Collinearity refers to measuring the same variable twice. It presents problems in regression analysis. In our example, the number of users and size of discount seems collinear. The analyst should investigate this issue later in the regression analysis. ## The Null Hypothesis ($$H_0$$) The null hypothesis, or testable statement, is there there is no relationship between these two variables. This null hypothesis appears similar to the null hypothesis from cross tabulation with a $$\chi^2$$ test for independence. However, unlike a cross tabulation with a $$\chi^2$$ test for independence, the correlation test is rho ($$\rho$$) equals 0. Rho refers to the coefficient of the population. ## Other Types of Correlation There are a number of different types of correlation: • Point Biserial correlation refers to including a dichotomous or binary variable (e.g., yes/no, on/off) (i.e., nonmetric data) with an interval or ratio variable (i.e., metric data). The interpretation changes to: as we move from no (or off) to yes (or on) then the other variable moves in a positive (or negative) direction. • Spearman Rho correlation is used to find a relationship between two ordinal variables. For examples, a web analyst wants to compare page ranks from Google to Bing. Spearman Rho correlation could explain how related the two ranks are. • Phi correlation is applied to understand the relationship between two dichotomous or binary variables. Instead of a Phi correlation, the digital analyst is better off setting up a cross tabulation with a chi square test for independence. ## Performing correlation analysis in R The base function cor() will perform correlations on a data.frame. Let’s give this a go with some data. This exercise requires having a web_data data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the step-by-step instructions below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page. Or, if you have access to an Adobe Analytics account, then you can use your own data by following the Generating web_data steps on the Adobe Analytics API page. Once you have a web_data data frame to work with, the command head(web_data) should return a table that, at least structurally, looks something like this: date channelGrouping deviceCategory sessions pageviews entrances bounces 2016-01-01 (Other) desktop 19 23 19 15 2016-01-01 (Other) mobile 112 162 112 82 2016-01-01 (Other) tablet 24 41 24 19 ## Let’s correlate! Correlations will only work with numeric data, so we subset to just those columns and then run the base R function cor() to see a correlation table: web_data_metrics <- web_data[,c("sessions","pageviews","entrances","bounces")] ## see correlation between all metrics kable(cor(web_data_metrics)) sessions pageviews entrances bounces sessions 1.0000000 0.8384321 0.9999923 0.9411201 pageviews 0.8384321 1.0000000 0.8377078 0.6364753 entrances 0.9999923 0.8377078 1.0000000 0.9416535 bounces 0.9411201 0.6364753 0.9416535 1.0000000 The table is mirrored in the diagonal and provides the correlation coefficient (aka, “$$r$$”) between each pair of values that intersect in the cell. 1 means a perfect correlation, 0 means no correlation and -1 means a perfect negative correlation. Does the R that we’re working on learning today have anything to do with the correlation coefficient $$r$$? Well…no. Or, at least, only to the extent that you can use R-the-platform to calculate r-the-correlation-coefficient. Good question, though! When working with correlations, its always a good idea to view an exploratory plot. A handy function for this is pairs() which creates a scatter plot of all the metrics passed in combination: pairs(web_data_metrics) Here you can see the correlation numbers in graphical form. For instance, the high correlation of 0.9999923 between sessions and entrances results in an almost perfect straight line. Since a session starts with an entrance, this makes perfect sense! A correlation of less than 1 may be a quick diagnostic that something is wrong with the tracking. ## How do web channels correlate? One useful piece of analysis is seeing how web channels possibly interact. ### Data Prep To get the data in the right format, the below code pivots via the reshape2 package: ## Use tidyverse to pivot the data library(dplyr) library(tidyr) ## Get only desktop rows, and the date, channelGrouping and sessions columns pivoted <- web_data %>% filter(deviceCategory == "desktop") %>% select(date, channelGrouping, sessions) %>% spread(channelGrouping, sessions) ## Get rid of any NA's and replace with 0 pivoted[is.na(pivoted)] <- 0 kable(head(pivoted)) date (Other) Direct Display Email Organic Search Paid Search Referral Social Video 2016-01-01 19 133 307 17 431 555 131 68 0 2016-01-02 156 1003 196 43 1077 1060 226 158 3 2016-01-03 35 1470 235 29 696 489 179 66 90 2016-01-04 31 1794 321 70 1075 558 235 46 898 2016-01-05 27 1899 309 74 1004 478 218 47 461 2016-01-06 21 1972 204 299 974 494 246 47 418 Take a minute to examine what the pivoted data looks like? Is it tidy data? Not exactly. But, that’s good! In one since, we’ve got separate “metrics” for each day now – the channelGrouping-sessions combination. ### Examining the Data We can plot and correlate all the metrics for an overview. Because we don’t want to do exactly the same thing as we did earlier (where’s the fun in that?!), let’s go ahead and round the correlation coefficients to two decimal places using the round() function. Other than that, we’ll do exactly what we already did when we were simply correlating the metrics in our data set: ## can't include the date as its not numeric, so remove cor_data <- pivoted[, -1] ## not including first column, so -1 subset cor_table <- round(cor(cor_data),2) kable(cor_table) (Other) Direct Display Email Organic Search Paid Search Referral Social Video (Other) 1.00 0.13 0.12 0.09 0.17 0.03 -0.04 0.46 0.37 Direct 0.13 1.00 0.04 0.07 0.22 -0.02 0.09 -0.05 0.01 Display 0.12 0.04 1.00 0.01 -0.05 0.05 -0.10 0.19 0.27 Email 0.09 0.07 0.01 1.00 0.26 0.03 0.13 0.03 0.13 Organic Search 0.17 0.22 -0.05 0.26 1.00 0.51 0.19 -0.04 0.08 Paid Search 0.03 -0.02 0.05 0.03 0.51 1.00 0.08 -0.01 0.22 Referral -0.04 0.09 -0.10 0.13 0.19 0.08 1.00 -0.11 -0.25 Social 0.46 -0.05 0.19 0.03 -0.04 -0.01 -0.11 1.00 0.50 Video 0.37 0.01 0.27 0.13 0.08 0.22 -0.25 0.50 1.00 pairs(cor_data) ### Analysis Now, when we compare channels, we see much looser correlations for this dataset, which makes sense, right? Correlations under 0.3 are, as a rule-of-thumb, not worth considering, so the standouts look to be Social vs. Video* and Paid** vs. Organic Search. Plotting those channels, we can examine the trends to see the shape of the data Correlation has help us zero in on possibly interesting relationships library(ggplot2) gg <- ggplot(data = pivoted) + theme_minimal() + ggtitle("Paid (blue) vs Organic (green) search") gg <- gg + geom_line(aes(x = as.Date(date), y = Paid Search), col = "blue") gg + geom_line(aes(x = as.Date(date), y = Organic Search), col = "green") We can see here the trends do look similar, but with a paid search peak in the first quarter (as we look at this, we might want to consider these spikes as outliers – either simply by the visual or using a more defined method for detecting outliers; it would be quite simple to remove this data from the data set and run the analysis again…but we’re not going to go down that particular rabbit hole right now). library(ggplot2) gg <- ggplot(data = pivoted) + theme_minimal() + ggtitle("Social (red) vs Video (orange)") gg <- gg + geom_line(aes(x = as.Date(date), y = Social), col = "red") gg + geom_line(aes(x = as.Date(date), y = Video), col = "orange") Here, a peak in Social late in the year looks to have coincided with a peak in Video: possibly a campaign driving video views? ## Cross correlation The correlations, above all, compare the same date point, but what if you expect a lagged effect? Perhaps the video traffic drove social traffic later on due to client advocacy? Cross correlations are useful when dealing with time-series data and can examine if a metric has an influence on itself or another after some time has passed. This can be a powerful way to find if, say, a TV or Display campaign increased SEO traffic over the course of the few weeks following the campaign. The below compares paid search on SEO using the ccf() function. The result is the correlation for different lags of days. We can see a correlation at 0 lag at around 0.5, but the correlation increases if you lag the Social trend up to 10 days before. ccf(pivoted$Social, pivoted\$Video) You could then conclude that Video was having a lagged effect on Social traffic up to 10 days beforehand. But, beware! The nature of cross-correlation is that if both datasets have a similar looking spike, cross-correlation will highlight it. Careful examination of the raw data trends should be performed to verify it. In some cases a smoother line will help get rid of spikes that affect the data (e.g., do the analysis on weekly or monthly data instead of daily). And, of course, there is no substitue for rational thought: if you find a relationship like this, can you explain it rationally? And, if so, can you conduct further analysis to validate that rationalization? (If only R had an Easy Button to do the actual thinking, too. Alas!)