What’s in a name?
So do I have piping hot data or am I piping hot data? Let’s break it down.
Piping Hot Data
Wow, my data is piping hot! This connotes exciting, newly released data! I can’t promise that I’ll fulfill this expectation. Let’s just say that I’ll talk about data occasionally, and on special occasions it might even be piping hot.
Piping Hot Data
Here, I am piping my elusive hot data. This is what I was really going for - an ode to the pipe in R:
%>%
The pipe operator simplifies long operations by linking multiple functions simultaneously. Although the coding construct of the pipe has been floating around since the 1970’s, the floodgates didn’t open for R until 2014. I don’t know the exact date, but I do remember the first time I saw those three characters and the sea of emotions that rained down. Confusion. Curiousity. Excitement.
In just 4 short years, the pipe and its friends in the tidyverse have revolutionized how we code in R, to the point that you may feel illiterate at conferences if you don’t have some baseline understanding - at first. Because the beauty of the pipe is that it streamlines readability of R code, such that even if you have never done it, you can still get the gist of what is going on. So much so that believers are proselytizing “Teach the tidyverse to beginners”!
Let’s lay the pipelines with a quick example using the classic iris
data set. To get started, load the tidyverse
library and get an overview of the data.
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.~
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.~
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.~
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.~
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa,~
Our simple objective is to compute the mean Sepal.Length
for each Species
in the data set and then arrange the results in descending order. There are many ways to accomplish this without the tidyverse, but for the sake of side-by-side comparisons I’ll demo this using tidyverse functions first without and then with piping.
arrange(
summarise(
group_by(iris, Species),
mean = mean(Sepal.Length)
),
desc(mean)
)
# A tibble: 3 x 2
Species mean
<fct> <dbl>
1 virginica 6.59
2 versicolor 5.94
3 setosa 5.01
Without using pipes, we have to read our code inside to out to understand the operations executed. Our iris
data set is buried in the middle of the code, and then the operations performed of group_by
, summarise
, and arrange
spring outward from there (reading up from iris
). Now let’s try the same manipulations utilizing piping.
iris %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length)) %>%
arrange(desc(mean))
# A tibble: 3 x 2
Species mean
<fct> <dbl>
1 virginica 6.59
2 versicolor 5.94
3 setosa 5.01
Voilá! It’s clear from the left side of the pipe that all manipulations are done on the iris
data set, and it’s clear from the right side of the pipe that the series of operations performed are group_by
, summarise
, and arrange
. Wow, I like the way data flow through those pipes!
While the name of this blog gives an nod to the powerful pipe, the pipe isn’t going to permeate every solution to programming challenges. So here is what to expect from Piping Hot Data:
I hope you enjoy!
Thumbnail artwork by @allison_horst
.
For attribution, please cite this work as
Pileggi (2018, Nov. 5). PIPING HOT DATA: Welcome to Piping Hot Data. Retrieved from https://www.pipinghotdata.com/posts/2018-11-05-welcome-to-piping-hot-data/
BibTeX citation
@misc{pileggi2018welcome, author = {Pileggi, Shannon}, title = {PIPING HOT DATA: Welcome to Piping Hot Data}, url = {https://www.pipinghotdata.com/posts/2018-11-05-welcome-to-piping-hot-data/}, year = {2018} }