Hands-On Exploratory Data Analysis with R
上QQ阅读APP看书,第一时间看更新

Converting rectangular data into R with the readr R package

Tabular data, or flat rectangular data, comes in many different formats, including CSV and TSV. R's readr package provides an easy and flexible way to import all kinds of data into R. It also fails gracefully if there are issues with the data you are trying to import. You can load the readr package with the following command:

library(readr)

The simplest way to import data with readr package is to call the specific read data function for different file types, depending on the data you are reading. For example, in the following screenshot, we have a CSV file containing data about automobiles. This data is also bundled as an example dataset with the readr package, as shown in the following screenshot:

Use the following command to read a particular CSV file in each column:

read_csv("mtcars.csv")#> Parsed with column specification:
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )

Here, we have a CSV data file. For this, we used the read_csv function and passed the file path and name as arguments.

readr returns a tibble after reading in the data and it also prints the column specifications. Tibbles are data frames that represent values in rows and columns format. Here, we are loading a data file bundled with readr by default and saving the tibble in a variable:

cars_data <- read_csv(readr_example("mtcars.csv"))

The readr package is used for reading the data and then it prints the column specifications. This console output is very good for debugging. If you notice any issues with the comma separation, you can always copy and edit the columns in a different call, shown as follows:

#> Parsed with column specification:
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )

The read_csv function uses the first line of the CSV file as the column names. However, sometimes, the first few lines of data files contain some extra information and column names start a little down the line. We can use the skip parameter to skip the number of lines as follows:

read_csv("data.csv", skip = 2)

For example, in the preceding code, we skipped the first two lines of the file and asked readr to start reading from the third line.

Sometimes, the data doesn't have column names. We can pass the col_names = FALSE argument to the read_csv function, which specifies to read all the values even if column names are not present:

read_csv("data.csv", col_names = FALSE)

readr functions support passing in column or specifications to customize the data you are reading. For example, you can specify the type of each column with the col_types argument. Sometimes, it's a good idea to specify the column types because this ensures that there are no errors when reading data:

cars_data <- read_csv(readr_example("mtcars.csv"), col_types="ddddddddd")

Here, we specified the column type as Double. The following are the column types supported by readr:

  • col_logical() [l]: Contains only T, F, TRUE, or FALSE logics
  • col_integer() [i]: Integers 
  • col_double() [d]: Doubles
  • col_euro_double() [e]: Euro doubles that use , as the decimal separator
  • col_date() [D]: Y-m-d dates
  • col_datetime() [T]: ISO 8601 date times
  • col_character() [c]: Everything else

There are a lot of other self-explanatory options available when reading data with the read_csv function. For example, a fully loaded read_csv call will look like this:

read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = show_progress(), skip_empty_rows = TRUE)

The following are the parameters used in the preceding code:

  • file: This represents the filename
  • col_name: This represents the use of column names while reading CSV file
  • col_types: This represents the type of column
  • locale: This represents which locale should be used

Other parameters used are secondary parameters.