Importing Data from a CSV File

The most universal data format is .csv (comma-separated values). This is a format that almost every program can open, and that most data-creation platforms can export data to. And, it has the advantage of being human-readable if you open it for examination with, say, Notepad.

read.csv()

See ?read.csv to get an overview of the arguments for importing .csv files. If you read carefully, you’ll see that read.csv is actually a derived function from the more general read.table function, which can read other formats as well, such as tab-separated (.tsv).

read.csv() basically hardcodes defaults suitable for .csv in the UK/US, while read.csv2() has defaults set for European formats where currency uses a comma, ,, which, to avoid confusion, means a semi-colon ; is used instead to separate values.

A helpful argument to also include is stringsAsFactors = FALSE. This is a classic R quirk. We won’t get into it here, but know that it will avoid future problems where R treats strings as “factors”, which can cause unexpected results or, in other cases, prevent functions from working at all. (Factors are handy at times. It just seems like it’s more common to want values to not be factors, and, by default, R tends to have a bias towards viewing string-based data as factors; maybe that’s all well and good in social sciences or elsewhere, but it can trip things up in our world.)

Example CSV Import

A demo file called td.csv is available from http://bit.ly/rSampleGA.

If you save it to the ./data/ folder of the project you’re working with, then you can import it into your environment using the code below:

# Set the name/location of the file. We could put this inside the read.csv function,
# but it tends to make for cleaner, more manageable code to first assign it to an object.
# In this case, we're calling that object 'filename.' Clever, right?
filename <- "./data/td.csv"

# Now, actually read the file in and put the data in a data frame called 'mydata'.
mydata <- read.csv(filename, stringsAsFactors = FALSE)

# Finally, as a quick check, view the first six rows of data in the data frame using the
# head() function.
head(mydata)
      X       date sessions bounceRate      ratio
1     1 2015-01-01        6   83.33333 0.07200000
2     2 2015-01-02       15   93.33333 0.16071429
3     3 2015-01-03        7  100.00000 0.07000000
4     4 2015-01-04        8   75.00000 0.10666667
5     5 2015-01-05       21   76.19048 0.27562500
6     6 2015-01-06       20   95.00000 0.21052632

When first bringing data in, it can be useful to do a quick check on the structure of the data. As luck would have it, there’s a function for that: str(). This function provides a quick check of the data types, as well as provides a few sample values (when possible).

str(mydata)
'data.frame':   545 obs. of  5 variables:
 $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ date      : chr  "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
 $ sessions  : int  6 15 7 8 21 20 13 23 16 13 ...
 $ bounceRate: num  83.3 93.3 100 75 76.2 ...
 $ ratio     : num  0.072 0.161 0.07 0.107 0.276 ...

From looking at the structure (and/or the head), we can see that the X column is redundant and the ratio column is useless, so we can delete those by setting them to NULL:

# Get rid of the `X` and `ratio` columns.
mydata$X <- NULL
mydata$ratio <- NULL

Raw data often arrives with less-than-ideal column names, and the names() function can be used to get or set the column heading names. So, we can go ahead and use that function to put in some cleaner column names:

# Change the names
names(mydata) <- c("Date", "Visits", "BounceRate")

Now, if we run our head() function again, we can see that we’ve got a cleaner data frame.

head(mydata)
        Date Visits BounceRate
1 2015-01-01      6   83.33333
2 2015-01-02     15   93.33333
3 2015-01-03      7  100.00000
4 2015-01-04      8   75.00000
5 2015-01-05     21   76.19048
6 2015-01-06     20   95.00000

Hopefully, it’s obvious that, not only did we walk through after importing the data and do some cleanup, but we did so with code that can now be included as part of a script and reused with future data that we know will be arriving in the same format. (We wouldn’t necessarily include the str() and head() functions in that script, but we could/would include all of the operations that actually adjust the data.)