A few thoughts about pipes in R

August 25, 2023

  R tips pipe internals
  base magrittr dplyr

Kevin Cazelles

David Beauchesne

     

Tl;DR
Using pipes in R makes code cleaner. There are two good options for pipes: magrittr which brings the forward pipe %>% alongside four other pipes and the native pipe |> introduce in R 4.1.0. I have been piping for years, I started with the pipes from the magrittr package, now I use the native pipe.'

Piping in R

Back in 2014, I discovered the forward pipe for R introduced in magrittr and since that time, I never stopped piping, although my piping habits evolved over time, especially with the introduction of the native pipe in R 4.1.

When I started using pipes in R, I had some experience with the bash pipe, |, which basically passes the output of a function to the input of a second one, and so allow to do chaining operations. But using pipes in R was a major breakthrough: with a simple infix operator, lines of code involving a collection of nested function calls were suddenly turned into one readable data recipe. Let’s take an example and create a data pipeline where we apply a statistical model, model_1(), to the data set data_1 after two steps of data preprocessing, transform_1() and transform_2(). Here is the code without pipes:

1
2
3
4
5
6
7
res <- model_1(
    transform_1(
        transform_2(
            data_1
        )
    )
)

For now, let’s refer to the forward pipe as %pipe%. With pipe, one would write

1
transform_2(data_1)

as

1
data_1 %pipe% transform_2()

data_1 is the left hand side of the pipe (often abbreviated lhs) and transform_2() is its right hand side (lhs). Let’s rewrite the code using it:

1
2
3
4
var <- data_1 %pipe%
    transform_1() %pipe%
    transform_2() %pipe%
    model_1()

There are two main facts, one quickly grasps when looking at the two blocks of code:

  1. with pipes, you stop reading backwards: data_1 is at the beginning of the block, and model_1 is now at the last line, not at the first line;
  2. with pipes, it is easier to deal with parentheses, the code is more readable.

A little more subtle is that is it easier to comment out parts of the recipe. Say we need to comment out transform_2(), without pipes, we would do something like this:

1
2
3
4
5
6
7
var <- model_1(
    #transform_2(
        transform_1(
            inti_var
        )
    #)
)

whereas with pipes, the code would look something like that:

1
2
3
4
var <- ini_var %pipe%
    transform_1() %pipe%
    #transform_2() %pipe%
    model_1()

This is not a major concern here, but it does help in more complex cases. Having code easy to read and easy to manipulate is particularly relevant for the R community because we are a group of data recipe writers and our recipes may include tens of steps. It is thus no surprise that the community quickly adopted the magrittr’s pipe and tidyverse tremendously helped in popularizing the use of pipes in the community (e.g. see commit 89aaa9a8b of dplyr on April 14th, 2014 where the magrittr pipe was adopted).

magrittr pipes

A package of internals

I am assuming that most of R users are familiar with the forward pipe %>% and its placeholder . – the symbols representing the object being forwarded. I am also assuming that most of R users have used it through the meta package tidyverse, or one of the packages included, most likely dplyr. So here, instead of focusing on how to use the forward pipe, I would like to mention a few technical details as well as the other pipes magrittr includes.

magrittr is a package that brings the forward pipe %>% aklong with four other pipes: %<>%, %$%, %!>%, %T>% as well as a several functions that can be used with the pipe.

If you look at the source code, pipes are defined in pipe.R and for instance the following lines defined the forward pipe (see pipe.R L130-L137):

1
2
3
4
5
6
7
8
`%>%` <- function(lhs, rhs) {
  lhs <- substitute(lhs)
  rhs <- substitute(rhs)
  kind <- 1L
  env <- parent.frame()
  lazy <- TRUE
  .External2(magrittr_pipe)
}

the last line is a call to the primitive .External2() that will call an external C function, magrittr_pipe(), that is an R internal structure (a SEXP, see Rinternals.h). If you go to the folder source you will find the lines in pipe.c that define magrittr_pipe() :

1
2
3
4
SEXP magrittr_pipe(SEXP call, SEXP op, SEXP args, SEXP rho) {
  args = CDR(args);
  [...]
}

Hence, when you load magrittr, you are using new internal functions including the forward pipe along with four others pipes:

The four additional pipes

Let’s load magrittr and let me give an example for these pipes. The assignment pipe allows you to assign the value while piping. Here is a example where you would use two steps in your data pipeline to modify the dataset CO2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
library(magrittr)
library(dplyr)
library(datasets) # CO2 is included in this package
CO2b <- CO2 %>%
    filter(conc > 200)
dim(CO2b)
#R>  [1] 60  5
CO2b <- CO2b %>%
    mutate(conc2 = 2 * conc)
head(CO2b)
#R>    Plant   Type  Treatment conc uptake conc2
#R>  1   Qn1 Quebec nonchilled  250   34.8   500
#R>  2   Qn1 Quebec nonchilled  350   37.2   700
#R>  3   Qn1 Quebec nonchilled  500   35.3  1000
#R>  4   Qn1 Quebec nonchilled  675   39.2  1350
#R>  5   Qn1 Quebec nonchilled 1000   39.7  2000
#R>  6   Qn2 Quebec nonchilled  250   37.1   500

With the assignment pipe you can use CO2b %<>% instead of CO2b <- CO2b

1
2
3
4
5
6
7
CO2c <- CO2 %>%
    filter(conc > 200)
dim(CO2c)
#R>  [1] 60  5
CO2c %<>% mutate(conc2 = 2 * conc)
identical(CO2b, CO2c) # check whether we are creating the same object
#R>  [1] TRUE

With the Tee pipe, you can call a function without including the output in the pipeline while retaining the side effect of the function. This van be very handy for print and plot functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CO2d <- CO2 %>%
    filter(conc > 200) %>%
    mutate(conc2 = 2 * conc) %T>%
    print(max = 50)
#R>    Plant   Type  Treatment conc uptake conc2
#R>  1   Qn1 Quebec nonchilled  250   34.8   500
#R>  2   Qn1 Quebec nonchilled  350   37.2   700
#R>  3   Qn1 Quebec nonchilled  500   35.3  1000
#R>  4   Qn1 Quebec nonchilled  675   39.2  1350
#R>  5   Qn1 Quebec nonchilled 1000   39.7  2000
#R>  6   Qn2 Quebec nonchilled  250   37.1   500
#R>  7   Qn2 Quebec nonchilled  350   41.8   700
#R>  8   Qn2 Quebec nonchilled  500   40.6  1000
#R>   [ reached 'max' / getOption("max.print") -- omitted 52 rows ]
identical(CO2d, CO2c)
#R>  [1] TRUE

The exposition pipe exposes the names of the lhs of the pipe so they can be called from the rhs of the pipe.

1
2
3
4
5
6
7
8
CO2 %>%
    head() %>%
    print(conc) # does not work
#R>  Error: object 'conc' not found
CO2 %>%
    head() %$%
    print(conc)
#R>  [1]  95 175 250 350 500 675

Finally, the eager pipe overcomes the lazy evaluation behaviour in R which is beyond the scope of this post, but if you are curious, have a look at this great blog post by Collin Fay. So here is an example where two functions that prompt a message on evaluation (pay attention to the order of the messages):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
trans1 <- function(x) {
    cat("------- 1. trans1\n")
    x$nrow <- nrow(x)
    x
}
trans2 <- function(x) {
    cat("------- 2. trans2\n")
    head(x, 4)
}
# with the forward pipe (functions are evaluated lazyly)
CO2e <- CO2 %>%
    trans1() %>%
    trans2()
#R>  ------- 2. trans2
#R>  ------- 1. trans1
# with the eager pipe (functions are evaluated eagerly)
CO2f <- CO2 %!>%
    trans1() %!>%
    trans2()
#R>  ------- 1. trans1
#R>  ------- 2. trans2
# final outputs are the same
CO2f
#R>    Plant   Type  Treatment conc uptake nrow
#R>  1   Qn1 Quebec nonchilled   95   16.0   84
#R>  2   Qn1 Quebec nonchilled  175   30.4   84
#R>  3   Qn1 Quebec nonchilled  250   34.8   84
#R>  4   Qn1 Quebec nonchilled  350   37.2   84
identical(CO2e, CO2f)
#R>  [1] TRUE

The native pipe

R 4.1 introduced the native pipe |>. In the NEWS file, section CHANGES IN R 4.1.0, the pipe was announced with the following message :

R now provides a simple native forward pipe syntax |>. The simple form of the forward pipe inserts the left-hand side as the first argument in the right-hand side call. The pipe implementation as a syntax transformation was motivated by suggestions from Jim Hester and Lionel Henry.

Several blog posts have explained how to use it (e.g. this blog post on Towards data science). I was curious about the changes that come with such new feature, so I did a quick search in the source code (using the mirror available on GitHub at https://github.com/wch/r-source):

1
2
3
4
git log --oneline -S 'native pipe'

263d6bcf0b Use => syntax to pass pipe lhs to non-first-argument on rhs.
a1425adea5 Added native pipe and function shorthand syntax.

I then checked the files modified in commit a1425adea5:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
git show --name-only a1425adea5 
commit a1425adea54bcc98eef86081522b5dbb3e149cdc
Author: luke <luke@00db46b3-68df-0310-9c12-caf00c1e9a41>
Date:   Thu Dec 3 22:59:31 2020 +0000
 
    Added native pipe and function shorthand syntax.
    
    git-svn-id: https://svn.r-project.org/R/trunk@79553 00db46b3-68df-0310-9c12-caf00c1e9a41
 
doc/NEWS.Rd
src/include/Rinternals.h
src/library/base/man/function.Rd
src/library/base/man/pipeOp.Rd
src/main/gram.c
src/main/gram.y
src/main/names.c

Commit a1425adea5 introduced the pipe and 7 files were modified to do so. Note that git show one can quickly check all the changes:

1
git show a1425adea54bcc98eef86081522b5dbb3e149cdc

and with -- path one may focus on a specific file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
git show a1425adea54bcc98eef86081522b5dbb3e149cdc -- src/include/Rinternals.h

commit a1425adea54bcc98eef86081522b5dbb3e149cdc
Author: luke <luke@00db46b3-68df-0310-9c12-caf00c1e9a41>
Date:   Thu Dec 3 22:59:31 2020 +0000
 
    Added native pipe and function shorthand syntax.
    
    
    git-svn-id: https://svn.r-project.org/R/trunk@79553 00db46b3-68df-0310-9c12-caf00c1e9a41
 
diff --git a/src/include/Rinternals.h b/src/include/Rinternals.h
index 143cf24ab7..0a5e446b0e 100644
--- a/src/include/Rinternals.h
+++ b/src/include/Rinternals.h
@@ -1034,6 +1034,7 @@ LibExtern SEXP    R_DotsSymbol;       /* "..." */
 LibExtern SEXP R_DoubleColonSymbol;// "::"
 LibExtern SEXP R_DropSymbol;       /* "drop" */
 LibExtern SEXP R_EvalSymbol;       /* "eval" */
+LibExtern SEXP R_FunctionSymbol;   /* "function" */
 LibExtern SEXP R_LastvalueSymbol;  /* ".Last.value" */
 LibExtern SEXP R_LevelsSymbol;     /* "levels" */
 LibExtern SEXP R_ModeSymbol;       /* "mode" */

Investigating further, I found that the symbol |> is declared in names.c and that xxpipe() is basically the definition of the pipe (see gram.y).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static SEXP    xxpipe(SEXP, SEXP);
/* [...] */

static SEXP xxpipe(SEXP lhs, SEXP rhs) {
    SEXP ans;
    if (GenerateCode) {
       /* allow for rhs lambda expressions */
       if (TYPEOF(rhs) == LANGSXP && CAR(rhs) == R_FunctionSymbol)
           return lang2(rhs, lhs);
                   
       if (TYPEOF(rhs) != LANGSXP)
           error(_("The pipe operator requires a function call "
                   "or an anonymous function expression as RHS"));

        SEXP fun = CAR(rhs);
        SEXP args = CDR(rhs);


       /* rule out syntactically special functions */
       /* the IS_SPECIAL_SYMBOL bit is set in names.c */
       if (TYPEOF(fun) == SYMSXP && IS_SPECIAL_SYMBOL(fun))
           error("function '%s' not supported in RHS call of a pipe",
                 CHAR(PRINTNAME(fun)));
       
       PRESERVE_SV(ans = lcons(fun, lcons(lhs, args)));
    }
    else {
       PRESERVE_SV(ans = R_NilValue);
    }
    RELEASE_SV(lhs);
    RELEASE_SV(rhs);
    return ans;
}

And this is the code that makes the pipe working.

1
2
3
4
5
CO2g <- CO2 |>
    filter(conc > 200) |>
    mutate(conc2 = 2 * conc)
identical(CO2g, CO2b)
#R>  [1] TRUE

Though I don’t see any advantages of doing so, it is possible to combine the two pipes.

1
2
3
4
5
CO2h <- CO2 |>
    filter(conc > 200) %>%
    mutate(conc2 = 2 * conc)
identical(CO2h, CO2b)
#R>  [1] TRUE

Even if the code of xxpipe() has changed in R 4.3.2 and 4.3.3, we can still demonstrate how to hit the two errors captured in the code (though the error messages are slightly different):

1
2
CO2 |> head
#R>  Error: The pipe operator requires a function call as RHS (<text>:1:8)
1
2
CO2 |> `+`()
#R>  Error: function '+' not supported in RHS call of a pipe (<text>:1:8)

One of the main reason for the changes in the xxpipe() code is the recent introduction of the placeholder, _, in R 4.2.0 (see CHANGES IN R 4.2.0 in the NEWS file):

In a forward pipe |> expression it is now possible to use a named argument with the placeholder _ in the rhs call to specify where the lhs is to be inserted. The placeholder can only appear once on the rhs.

Placeholder that was recently updated (see CHANGES IN R 4.3.0 NEWS file):

As an experimental feature the placeholder _ can now also be used in the rhs of a forward pipe |> expression as the first argument in an extraction call, such as _$coef. More generally, it can be used as the head of a chain of extractions, such as _$coef[[2]].

So here is an example of what can be done in R>4.3.0.

1
2
3
4
CO2 |>
    _$conc |>
    sum()
#R>  [1] 36540

When coding in the console I have been using more and more frequently the operator ->, that really comes in handy. You may be aware that R has multiple assignment operators, for historical reasons as mentioned by (Chambers, 2016) (page 73 in a footnote):

The specific choice of <- dates back to the first version of S. We chose it in order to emphasize that the left and right operands are entirely different: a target on the left and a computed value on the right. Later versions of S and R accept the = operator, but for exposition the original choice remains clearer.

R users mainly use <- and sometimes = (frequently used by users that have experience with other programming languages), but -> feels especially appropriate when piping as it concludes well the data recipe.

1
2
3
4
5
CO2 |>
    _$conc |>
    sum() -> tot_conc
tot_conc
#R>  [1] 36540

Final remarks

In this post I wrote down a few thoughts about the native pipe and the magrittr pipes. I did not attempt to compare the two. The main differences between the two pipes %>% and |> are summarized in one of Hadley Wickham recent blog post. Of course there is only one native pipe and 5 magrittr pipes. But the most significative difference regards the placeholders, . coming with more features than _. That said, given the latest experimental feature of _ in R 4.3.0, this may soon no longer hold true.

The tidyverse community still recommends the use of the magrittr forward pipe, and even provides a styling guide for it: https://style.tidyverse.org/pipes.html. I now code almost exclusively with the native pipe (following the same guidelines). I apply it everywhere and I have a bunch of shortcuts for function I use the most frequently in the console, e.g. to do |> names() or |> class(). I prefer |> mostly because it is native, meaning that I don’t need to load any package to use it. There are two additional minor pros: it is only two characters and Julia – which I use frequently – uses the same symbol. I even use it in packages and as long I not using the placeholder, the package only requires a version >= R 4.3.1 which is not bad (the oldrel is currently R 4.2.3).

References

Chambers JM. 2016. Extending R. Boca Raton London New York: CRC Press.

Display information relative to the R session used to render this post.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
sessionInfo()
#R>  R version 4.3.2 (2023-10-31)
#R>  Platform: x86_64-pc-linux-gnu (64-bit)
#R>  Running under: Ubuntu 22.04.3 LTS
#R>  
#R>  Matrix products: default
#R>  BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#R>  LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#R>  
#R>  locale:
#R>   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
#R>   [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
#R>   [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#R>  
#R>  time zone: UTC
#R>  tzcode source: system (glibc)
#R>  
#R>  attached base packages:
#R>  [1] stats     graphics  grDevices utils     datasets  methods   base     
#R>  
#R>  other attached packages:
#R>  [1] dplyr_1.1.4       magrittr_2.0.3    inSilecoRef_0.1.1
#R>  
#R>  loaded via a namespace (and not attached):
#R>   [1] sass_0.4.8        utf8_1.2.4        generics_0.1.3    xml2_1.3.6        blogdown_1.19    
#R>   [6] stringi_1.8.3     httpcode_0.3.0    digest_0.6.34     evaluate_0.23     bookdown_0.37    
#R>  [11] fastmap_1.1.1     plyr_1.8.9        jsonlite_1.8.8    backports_1.4.1   crul_1.4.0       
#R>  [16] promises_1.2.1    fansi_1.0.6       jquerylib_0.1.4   bibtex_0.5.1      cli_3.6.2        
#R>  [21] shiny_1.8.0       rlang_1.1.3       ellipsis_0.3.2    cachem_1.0.8      yaml_2.3.8       
#R>  [26] tools_4.3.2       httpuv_1.6.14     DT_0.31           rcrossref_1.2.0   curl_5.2.0       
#R>  [31] vctrs_0.6.5       R6_2.5.1          mime_0.12         lifecycle_1.0.4   stringr_1.5.1    
#R>  [36] fs_1.6.3          htmlwidgets_1.6.4 miniUI_0.1.1.1    pkgconfig_2.0.3   bslib_0.6.1      
#R>  [41] pillar_1.9.0      later_1.3.2       glue_1.7.0        Rcpp_1.0.12       systemfonts_1.0.5
#R>  [46] xfun_0.42         tibble_3.2.1      tidyselect_1.2.0  knitr_1.45        xtable_1.8-4     
#R>  [51] htmltools_0.5.7   svglite_2.1.3     rmarkdown_2.25    compiler_4.3.2