PIPING HOT DATA: Welcome to Piping Hot Data

Shannon Pileggi

So do I have piping hot data or am I piping hot data? Let’s break it down.

Piping Hot Data

Wow, my data is piping hot! This connotes exciting, newly released data! I can’t promise that I’ll fulfill this expectation. Let’s just say that I’ll talk about data occasionally, and on special occasions it might even be piping hot.

Piping Hot Data

Here, I am piping my elusive hot data. This is what I was really going for - an ode to the pipe in R:

%>%

The pipe operator simplifies long operations by linking multiple functions simultaneously. Although the coding construct of the pipe has been floating around since the 1970’s, the floodgates didn’t open for R until 2014. I don’t know the exact date, but I do remember the first time I saw those three characters and the sea of emotions that rained down. Confusion. Curiousity. Excitement.

In just 4 short years, the pipe and its friends in the tidyverse have revolutionized how we code in R, to the point that you may feel illiterate at conferences if you don’t have some baseline understanding - at first. Because the beauty of the pipe is that it streamlines readability of R code, such that even if you have never done it, you can still get the gist of what is going on. So much so that believers are proselytizing “Teach the tidyverse to beginners”!

Let’s lay the pipelines with a quick example using the classic iris data set. To get started, load the tidyverse library and get an overview of the data.

library(tidyverse)
glimpse(iris)

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.~
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.~
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.~
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.~
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa,~

Our simple objective is to compute the mean Sepal.Length for each Species in the data set and then arrange the results in descending order. There are many ways to accomplish this without the tidyverse, but for the sake of side-by-side comparisons I’ll demo this using tidyverse functions first without and then with piping.

arrange(
  summarise(
    group_by(iris, Species), 
    mean = mean(Sepal.Length)
  ), 
  desc(mean)
)

# A tibble: 3 x 2
  Species     mean
  <fct>      <dbl>
1 virginica   6.59
2 versicolor  5.94
3 setosa      5.01

Without using pipes, we have to read our code inside to out to understand the operations executed. Our iris data set is buried in the middle of the code, and then the operations performed of group_by, summarise, and arrange spring outward from there (reading up from iris). Now let’s try the same manipulations utilizing piping.

iris %>% 
  group_by(Species) %>% 
  summarise(mean = mean(Sepal.Length)) %>%
  arrange(desc(mean))

# A tibble: 3 x 2
  Species     mean
  <fct>      <dbl>
1 virginica   6.59
2 versicolor  5.94
3 setosa      5.01

Voilá! It’s clear from the left side of the pipe that all manipulations are done on the iris data set, and it’s clear from the right side of the pipe that the series of operations performed are group_by, summarise, and arrange. Wow, I like the way data flow through those pipes!

While the name of this blog gives an nod to the powerful pipe, the pipe isn’t going to permeate every solution to programming challenges. So here is what to expect from Piping Hot Data:

Demo data science tools and methods.
Discover new data and R packages.
Deliberate data and technical topics.

I hope you enjoy!

Acknowledgements

Thumbnail artwork by @allison_horst.

Welcome to Piping Hot Data

Acknowledgements

Citation