Creating a data frame is fairly simple but when you need to create a large empty data frame with columns that have different classes it takes several command lines. A few days ago, I decided to write a function to simplify this operation and I came to realize that such a function would actually be very useful to ease the row binding of data frames whose column names partially match. How so? This post is meant to answer this question!
inSilecoMisc
First of all, the functions I am using in this post are available in inSilecoMisc which is an R package where we gathered the miscellaneous functions we wrote and deem worth sharing on GitHub. So the first step to reproduce the examples below is to install inSilecoMisc
which is straightforward with the [devtools](https://cran.r-project.org/web/packages/devtools/index.html package) :
1
2
|
library(devtools)
install_github("inSileco/inSilecoMisc")
|
Then, load it:
In this post, I’ll exemplify how to use dfTemplate()
and dfTemplateMatch()
but if you are interested in other functions in the packages, check out the tour vignette.
Generating empty data frames efficiently
Let’s start with dfTemplate()
that creates a data frame with a specific number of columns.
1
2
3
4
5
6
|
df1 <- dfTemplate(cols = 2)
df1
#R> Var1 Var2
#R> 1 NA NA
class(df1)
#R> [1] "data.frame"
|
By default, the data frame created has only 1 row and the columns are filled out with NA
. This can readily be changed using arguments nrows
and fill
.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
df2 <- dfTemplate(2, nrows = 4, fill = 0)
df2
#R> Var1 Var2
#R> 1 0 0
#R> 2 0 0
#R> 3 0 0
#R> 4 0 0
df3 <- dfTemplate(cols = 2, nrows = 3, fill = "")
df3
#R> Var1 Var2
#R> 1
#R> 2
#R> 3
|
Columns classes are determined by fill
:
1
2
3
4
5
6
|
class(df1[, 1])
class(df2[, 1])
class(df3[, 1])
#R> [1] "logical"
#R> [1] "numeric"
#R> [1] "character"
|
And col_classes
is used to changed these classes:
1
2
3
4
5
|
df4 <- dfTemplate(cols = 2, col_classes = "character")
class(df4[, 1])
class(df4[, 2])
#R> [1] "character"
#R> [1] "character"
|
Arguments fill
and col_classes
can be vectors that specify content and class
for every columns:
1
2
3
4
5
6
7
8
9
10
11
12
|
df5 <- dfTemplate(2, 5, col_classes = c("character", "numeric"), fill = c("", 5))
df5
class(df5[, 1])
class(df5[, 2])
#R> Var1 Var2
#R> 1 5
#R> 2 5
#R> 3 5
#R> 4 5
#R> 5 5
#R> [1] "character"
#R> [1] "numeric"
|
Another useful feature of dfTemplate()
is that column names of the data frame to be created can be passed as first argument (cols
) :
1
|
df5 <- dfTemplate(c("category", "value"))
|
So, now you are able to create custom data frames with a set of column names!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
nms <- LETTERS[1:10]
df6 <- dfTemplate(nms, 10, fill = tolower(nms))
df6
#R> A B C D E F G H I J
#R> 1 a b c d e f g h i j
#R> 2 a b c d e f g h i j
#R> 3 a b c d e f g h i j
#R> 4 a b c d e f g h i j
#R> 5 a b c d e f g h i j
#R> 6 a b c d e f g h i j
#R> 7 a b c d e f g h i j
#R> 8 a b c d e f g h i j
#R> 9 a b c d e f g h i j
#R> 10 a b c d e f g h i j
|
How to flexibly rbind
a list of data frames
Sometimes we need to rbind
data frames that do not have all the columns the final data frame must contain. In such case, we first need to append the missing columns because otherwise the default rbind
function won’t work. Another solution is to use a package that has a function that do so. For instance, rbind.fill()
from the plyr
package allows to perform such flexible rbind
. Also, the package data.table includes a rbind()
method for data.table
objects that handles such situation (see this answer on ). In this last section, I would like to show how to rbind
data frames flexibly with dfTemplateMatch()
that is written in base R.
Let me first introduces dfTemplateMatch()
. This function takes a data frame as the first argument (x
) and the second argument (y
) could be another data frame or a vector of character strings. Based on x
and y
, dfTemplateMatch()
creates a data frame that has the same number of rows as x
and add columns for all names found in y
that are not found in x
. Before calling dfTemplateMatch()
I create two data frames :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
df7 <- df6[1:5, 1:4]
df7
#R> A B C D
#R> 1 a b c d
#R> 2 a b c d
#R> 3 a b c d
#R> 4 a b c d
#R> 5 a b c d
df8 <- df6[4:6]
df8
#R> D E F
#R> 1 d e f
#R> 2 d e f
#R> 3 d e f
#R> 4 d e f
#R> 5 d e f
#R> 6 d e f
#R> 7 d e f
#R> 8 d e f
#R> 9 d e f
#R> 10 d e f
|
Now I use dfTemplateMatch()
to create a third data frame based on two other:
1
2
3
4
5
6
7
|
dfTemplateMatch(df7, df8)
#R> A B C D E F
#R> 1 a b c d NA NA
#R> 2 a b c d NA NA
#R> 3 a b c d NA NA
#R> 4 a b c d NA NA
#R> 5 a b c d NA NA
|
As expected, the output has 5 rows as df6
and columns that are not in df6
but in df7
has been appended to df6
. It is possible to use arguments fill
and col_classes
to custom the columns added.
1
2
3
4
5
6
7
|
dfTemplateMatch(df7, df8, fill = 1, col_classes = "numeric")
#R> A B C D E F
#R> 1 a b c d 1 1
#R> 2 a b c d 1 1
#R> 3 a b c d 1 1
#R> 4 a b c d 1 1
#R> 5 a b c d 1 1
|
And there is an argument yonly
that allows the user to keep only names of y
(when yonly = TRUE
).
1
2
3
4
5
6
7
|
dfTemplateMatch(df7, df8, yonly = TRUE, fill = 1, col_classes = "numeric")
#R> D E F
#R> 1 d 1 1
#R> 2 d 1 1
#R> 3 d 1 1
#R> 4 d 1 1
#R> 5 d 1 1
|
Now let me show you how to rbind()
a specific subset of columns of a list of data frame that may or may not have these columns. Let me start by creating a list of data frames.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
lsdf <- apply(
replicate(5, sample(nms, 5)),
2,
function(x) dfTemplate(x, nrows = 5, fill = tolower(x))
)
lsdf
#R> [[1]]
#R> H D B C F
#R> 1 h d b c f
#R> 2 h d b c f
#R> 3 h d b c f
#R> 4 h d b c f
#R> 5 h d b c f
#R>
#R> [[2]]
#R> I F A E B
#R> 1 i f a e b
#R> 2 i f a e b
#R> 3 i f a e b
#R> 4 i f a e b
#R> 5 i f a e b
#R>
#R> [[3]]
#R> A F G B H
#R> 1 a f g b h
#R> 2 a f g b h
#R> 3 a f g b h
#R> 4 a f g b h
#R> 5 a f g b h
#R>
#R> [[4]]
#R> D I B H J
#R> 1 d i b h j
#R> 2 d i b h j
#R> 3 d i b h j
#R> 4 d i b h j
#R> 5 d i b h j
#R>
#R> [[5]]
#R> C F H G E
#R> 1 c f h g e
#R> 2 c f h g e
#R> 3 c f h g e
#R> 4 c f h g e
#R> 5 c f h g e
|
So the goal here is to create a data frame that contains only the five first
columns, i.e. A, B, C, D, E, the remaining columns must be discarded and
when one selected column is missing, it must be added (filled out with NA
).
To do so, I simply need to call dfTemplateMatch()
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
lsdf2 <- lapply(lsdf, dfTemplateMatch, LETTERS[1:5], yonly = TRUE)
lsdf2
#R> [[1]]
#R> D B C A E
#R> 1 d b c NA NA
#R> 2 d b c NA NA
#R> 3 d b c NA NA
#R> 4 d b c NA NA
#R> 5 d b c NA NA
#R>
#R> [[2]]
#R> A E B C D
#R> 1 a e b NA NA
#R> 2 a e b NA NA
#R> 3 a e b NA NA
#R> 4 a e b NA NA
#R> 5 a e b NA NA
#R>
#R> [[3]]
#R> A B C D E
#R> 1 a b NA NA NA
#R> 2 a b NA NA NA
#R> 3 a b NA NA NA
#R> 4 a b NA NA NA
#R> 5 a b NA NA NA
#R>
#R> [[4]]
#R> D B A C E
#R> 1 d b NA NA NA
#R> 2 d b NA NA NA
#R> 3 d b NA NA NA
#R> 4 d b NA NA NA
#R> 5 d b NA NA NA
#R>
#R> [[5]]
#R> C E A B D
#R> 1 c e NA NA NA
#R> 2 c e NA NA NA
#R> 3 c e NA NA NA
#R> 4 c e NA NA NA
#R> 5 c e NA NA NA
|
And now I can seamlessly rbind()
the list lsdf2
!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
do.call(rbind, lsdf2)
#R> D B C A E
#R> 1 d b c <NA> <NA>
#R> 2 d b c <NA> <NA>
#R> 3 d b c <NA> <NA>
#R> 4 d b c <NA> <NA>
#R> 5 d b c <NA> <NA>
#R> 6 <NA> b <NA> a e
#R> 7 <NA> b <NA> a e
#R> 8 <NA> b <NA> a e
#R> 9 <NA> b <NA> a e
#R> 10 <NA> b <NA> a e
#R> 11 <NA> b <NA> a <NA>
#R> 12 <NA> b <NA> a <NA>
#R> 13 <NA> b <NA> a <NA>
#R> 14 <NA> b <NA> a <NA>
#R> 15 <NA> b <NA> a <NA>
#R> 16 d b <NA> <NA> <NA>
#R> 17 d b <NA> <NA> <NA>
#R> 18 d b <NA> <NA> <NA>
#R> 19 d b <NA> <NA> <NA>
#R> 20 d b <NA> <NA> <NA>
#R> 21 <NA> <NA> c <NA> e
#R> 22 <NA> <NA> c <NA> e
#R> 23 <NA> <NA> c <NA> e
#R> 24 <NA> <NA> c <NA> e
#R> 25 <NA> <NA> c <NA> e
|
Voilà! This is what I call a flexible rbind
! I hope you’ll find this helpful! 💥
Display information relative to the R session used to render this post.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
sessionInfo()
#R> R version 4.4.1 (2024-06-14)
#R> Platform: x86_64-pc-linux-gnu
#R> Running under: Ubuntu 22.04.5 LTS
#R>
#R> Matrix products: default
#R> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#R> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#R>
#R> locale:
#R> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
#R> [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
#R> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#R>
#R> time zone: UTC
#R> tzcode source: system (glibc)
#R>
#R> attached base packages:
#R> [1] stats graphics grDevices utils datasets methods base
#R>
#R> other attached packages:
#R> [1] inSilecoMisc_0.7.0.9000 inSilecoRef_0.1.1
#R>
#R> loaded via a namespace (and not attached):
#R> [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 blogdown_1.19
#R> [6] stringi_1.8.4 httpcode_0.3.0 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.0
#R> [11] bookdown_0.40 fastmap_1.2.0 plyr_1.8.9 jsonlite_1.8.9 backports_1.5.0
#R> [16] crul_1.5.0 promises_1.3.0 fansi_1.0.6 jquerylib_0.1.4 bibtex_0.5.1
#R> [21] cli_3.6.3 shiny_1.9.1 crayon_1.5.3 rlang_1.1.4 cachem_1.1.0
#R> [26] yaml_2.3.10 tools_4.4.1 dplyr_1.1.4 httpuv_1.6.15 DT_0.33
#R> [31] rcrossref_1.2.0 curl_5.2.3 vctrs_0.6.5 R6_2.5.1 mime_0.12
#R> [36] lifecycle_1.0.4 stringr_1.5.1 fs_1.6.4 htmlwidgets_1.6.4 miniUI_0.1.1.1
#R> [41] pkgconfig_2.0.3 bslib_0.8.0 pillar_1.9.0 later_1.3.2 glue_1.8.0
#R> [46] Rcpp_1.0.13 xfun_0.48 tibble_3.2.1 tidyselect_1.2.1 knitr_1.48
#R> [51] xtable_1.8-4 htmltools_0.5.8.1 rmarkdown_2.28 compiler_4.4.1
|