inSilecoMisc 0.4.0 (part 1/2)

April 14, 2020

  R package development inSilecoMisc
  remotes inSilecoMisc

Kevin Cazelles

Steve Vissault

     

inSilecoMisc

inSilecoMisc is an R 📦 I have been maintaining for four years now. It was originally designed as a convenient way to share handy functions. Instead of stacking them in my .Rprofile, I created a package and made it available on GitHub. inSilecoMisc is therefore a set of miscellaneous functions, just as other R packages (e.g. Hmisc) but as I frequently change its API, it is meant to stay in the experimental stage! I like having inSilecoMisc on GitHub because it gives a large degree of freedom to experiment new ideas, new functions and the code remains available and easy to install! Despite the “API instability”, the overall quality of the package is constantly improving (at least, I hope so 😄). I have even included some of the somewhat more matured functions in the research compendium of recent manuscripts1.

Over the last month, I made substantial changes throughout the package. I re-wrote some functions, improved the documentation of others and added new features. I also changed the way the pkgdown website is deployed by Travis. In what follows, I introduced several functions included in versions 0.4.0. Note that as inSilecoMisc is on GitHub, an easy way to install inSilecoMisc is provided by the package remotes

1
2
install.packages("remotes") # if not already installed
remotes::install_github("inSileco/inSilecoMisc")

Once installed, let’s load it!

1
2
3
library("inSilecoMisc")
packageVersion("inSilecoMisc")
#R>  [1] '0.7.0.9000'

adjustStrings

As an ecologist, I frequently work with multiple datasets, and I often need to name/rename a bunch of files. One function I often use to do so is sprintf():

1
2
3
sprintf("file_%02d", 1:10)
#R>   [1] "file_01" "file_02" "file_03" "file_04" "file_05" "file_06" "file_07" "file_08" "file_09"
#R>  [10] "file_10"

I originally designed adjustStrings() to adjust the number of 0 to a string in a similar fashion:

1
2
adjustStrings(1:10, n = 2)
#R>   [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

But I was rapidly interested in having a more flexible function. So, I added several parameters to do that! In the last version of inSilecoMisc, adjustStrings() have 5 arguments:

  1. x: the input character vector to be adjusted;
  2. n: the number of characters to be added or used to determine the length of output strings;
  3. extra: the character(s) to be added (0 is the default value);
  4. align: the string alignment (“right”, “left” or “center”);
  5. add: whether n should be the constraint or a number of characters to be added (a constraint by default).

By default, adjustStrings() uses n as a constrain for the length of the output strings. so if use n = 4 instead of n = 2 in the first example, all elements of the output vector will have 4 characters:

1
2
adjustStrings(1:10, n = 4)
#R>   [1] "0001" "0002" "0003" "0004" "0005" "0006" "0007" "0008" "0009" "0010"

Add I change the value of extra to specify the replacement character(s) to be used :

1
2
3
4
5
6
7
8
adjustStrings(1:10, n = 4, extra = 1)
#R>   [1] "1111" "1112" "1113" "1114" "1115" "1116" "1117" "1118" "1119" "1110"
adjustStrings(1:10, n = 4, extra = "a")
#R>   [1] "aaa1" "aaa2" "aaa3" "aaa4" "aaa5" "aaa6" "aaa7" "aaa8" "aaa9" "aa10"
adjustStrings(1:10, n = 4, extra = "-")
#R>   [1] "---1" "---2" "---3" "---4" "---5" "---6" "---7" "---8" "---9" "--10"
adjustStrings(1:10, n = 4, extra = "ab")
#R>   [1] "aba1" "aba2" "aba3" "aba4" "aba5" "aba6" "aba7" "aba8" "aba9" "ab10"

With align, I can choose where extra characters are added:

1
2
3
4
5
6
adjustStrings(1:10, n = 4, extra = "-", align = "right") # default
#R>   [1] "---1" "---2" "---3" "---4" "---5" "---6" "---7" "---8" "---9" "--10"
adjustStrings(1:10, n = 4, extra = "-", align = "left")
#R>   [1] "1---" "2---" "3---" "4---" "5---" "6---" "7---" "8---" "9---" "10--"
adjustStrings(1:10, n = 4, extra = "-", align = "center")
#R>   [1] "--1-" "--2-" "--3-" "--4-" "--5-" "--6-" "--7-" "--8-" "--9-" "-10-"

Also, if add = TRUE, then exactly n extra characters are added:

1
2
3
4
5
6
adjustStrings(1:10, n = 4, extra = "-", align = "right", add = TRUE)
#R>   [1] "----1"  "----2"  "----3"  "----4"  "----5"  "----6"  "----7"  "----8"  "----9"  "----10"
adjustStrings(1:10, n = 4, extra = "-", align = "left", add = TRUE)
#R>   [1] "1----"  "2----"  "3----"  "4----"  "5----"  "6----"  "7----"  "8----"  "9----"  "10----"
adjustStrings(1:10, n = 4, extra = "-", align = "center", add = TRUE)
#R>   [1] "--1--"  "--2--"  "--3--"  "--4--"  "--5--"  "--6--"  "--7--"  "--8--"  "--9--"  "--10--"

Note that in this last case, lengths of output character strings may differ! One last remark about how adjustStrings() works when add = FALSE: for a given string, there are 3 scenarios :

  1. the string to be adjusted has more characters than n; in this case, the string is simply cut off:
1
2
adjustStrings("ABCD", n = 2, extra = "efgh")
#R>  [1] "AB"
  1. the string has more characters but the number of character for the adjustment is smaller than the number of extra’s character; in this case, extra is cut off:
1
2
adjustStrings("ABCD", n = 6, extra = "efgh")
#R>  [1] "efABCD"
  1. finally, when extra is too short to adjust the string according to n, extra is repeated:
1
2
adjustStrings("ABCD", n = 14, extra = "efgh")
#R>  [1] "efghefghefABCD"

These combinations give flexibility to the user. One application of the adjustStrings() function comes when running a long script. In such situation, it is convenient to create recognizable code section to ease the navigation across long output!

1
2
3
4
5
6
7
8
9
report <- function(title, symbol = "#") {
  cat(adjustStrings(title, 70, symbol, align = "center"), "\n")
}
report(" part 1 ")
report(" Solving ODE ", "~")
report(" Last but not least! ", "=")
#R>  ############################### part 1 ############################### 
#R>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Solving ODE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
#R>  ========================= Last but not least! ========================

keepWords

keepWords() allows the user to select words based on their position in character strings. Before using it, I would like to introduce loremIpsum() that generates placeholder texts. Note that there are several functions to generate placeholder texts available elsewhere. For instance, stringi has stri_rand_lipsum() and UsingR includes lorem(). By default, this loremIpsum() function returns two paragraphs of text:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
cat(loremIpsum())
#R>  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
#R>    tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
#R>    quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
#R>    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
#R>    cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
#R>    proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
#R>  
#R>    Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius,
#R>    turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis
#R>    sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus
#R>    et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut
#R>    ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt
#R>    sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum.
#R>    Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget,
#R>    consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl
#R>    adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque
#R>    nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis,
#R>    laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu,
#R>    feugiat in, orci. In hac habitasse platea dictumst.
#R>  

but it also allows the user to set the number of words of the character string returned:

1
2
loremIpsum(18)
#R>  [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod\n tempor incididunt ut labore et dolore magna"

Note the escape characters (\n)! Now let’s focus on keepWords(). Assuming I want to extract the second word in both sentences loremIpsum(18) and be or not to be, all I have to do is:

1
2
keepWords(c(loremIpsum(18), "be or not to be"), 2)
#R>  [1] "ipsum" "or"

and if I am interested in extracting specific word positions, I can pass a numerical vector:

1
2
3
keepWords(c(loremIpsum(18), "be or not to be"), c(1:4, 12:16))
#R>  [1] "Lorem ipsum dolor sit tempor incididunt ut labore et"
#R>  [2] "be or not to NA NA NA NA NA"

As you may have noted, NAs are added when the position selected does not exist. This behavior could be useful but also annoying! Fortunately, an argument allows to remove NAs.

1
2
3
keepWords(c(loremIpsum(18), "be or not to be"), c(1:4, 12:16), na.rm = TRUE)
#R>  [1] "Lorem ipsum dolor sit tempor incididunt ut labore et"
#R>  [2] "be or not to"

Also, collapse allows the user to change the character used to separate words:

1
2
keepWords(loremIpsum(18), c(1:6, 14:18), collapse = "-")
#R>  [1] "Lorem-ipsum-dolor-sit-amet-consectetur-ut-labore-et-dolore-magna"

and if collapse = NULL then list will be returned including a vector of the selected words per input string:

1
2
3
4
5
6
keepWords(c(loremIpsum(18), "be or not to be"), c(2:3), collapse = NULL)
#R>  [[1]]
#R>  [1] "ipsum" "dolor"
#R>  
#R>  [[2]]
#R>  [1] "or"  "not"

Note that all punctuation characters will be removed. This can be changed by tweaking argument split_words!

There are two other functions that work similarly: keepLetters() and keepInitials(). The former selects letters instead of words.

1
2
3
4
5
6
7
keepLetters(loremIpsum(18), c(1:6, 14:18))
#R>  [1] "Loremiorsit"
keepLetters(loremIpsum(18), c(1:6, 14:18), collapse = "-")
#R>  [1] "L-o-r-e-m-i-o-r-s-i-t"
keepLetters(loremIpsum(18), c(1:6, 14:18), collapse = NULL)
#R>  [[1]]
#R>   [1] "L" "o" "r" "e" "m" "i" "o" "r" "s" "i" "t"

while the latter extracts initials

1
2
3
4
keepInitials("National Basketball Association")
#R>  [1] "NBA"
keepInitials("National Basketball Association", "-")
#R>  [1] "N"

Note that if the input character vector has a mixture of lower and upper case, so will the output

1
2
keepInitials("National basketball association")
#R>  [1] "Nba"

if this annoys you, base functions upper() and lower() come in handy!

1
2
3
4
keepInitials(tolower("National basketball association"))
#R>  [1] "nba"
keepInitials(toupper("National basketball association"))
#R>  [1] "NBA"

Concluding remarks

In this post, I focused on inSileco’s functions that manipulate character strings (all of them call str_split() at some point). If you are interested in learning more about strings manipulations, you should checkout the ebook “Handling Strings with R” by Gaston Sanchez. There are also various blog posts on this topic (for instance http://uc-r.github.io or on data-flair) and, obviously, the documentation of packages that do such manipulations (e.g. stringi).

This concludes the first part of this post, additional features of the inSilecoMisc package will be introduced the second part of this post 🎉!

Display information relative to the R session used to render this post.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
sessionInfo()
#R>  R version 4.4.2 (2024-10-31)
#R>  Platform: x86_64-pc-linux-gnu
#R>  Running under: Ubuntu 22.04.5 LTS
#R>  
#R>  Matrix products: default
#R>  BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#R>  LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#R>  
#R>  locale:
#R>   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
#R>   [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
#R>   [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#R>  
#R>  time zone: UTC
#R>  tzcode source: system (glibc)
#R>  
#R>  attached base packages:
#R>  [1] stats     graphics  grDevices utils     datasets  methods   base     
#R>  
#R>  other attached packages:
#R>  [1] inSilecoMisc_0.7.0.9000 inSilecoRef_0.1.1      
#R>  
#R>  loaded via a namespace (and not attached):
#R>   [1] sass_0.4.9        generics_0.1.3    xml2_1.3.6        blogdown_1.19     stringi_1.8.4    
#R>   [6] httpcode_0.3.0    digest_0.6.37     magrittr_2.0.3    evaluate_1.0.1    bookdown_0.41    
#R>  [11] fastmap_1.2.0     plyr_1.8.9        jsonlite_1.8.9    backports_1.5.0   crul_1.5.0       
#R>  [16] promises_1.3.2    bibtex_0.5.1      jquerylib_0.1.4   cli_3.6.3         shiny_1.10.0     
#R>  [21] crayon_1.5.3      rlang_1.1.4       cachem_1.1.0      yaml_2.3.10       tools_4.4.2      
#R>  [26] dplyr_1.1.4       httpuv_1.6.15     DT_0.33           rcrossref_1.2.0   curl_6.0.1       
#R>  [31] vctrs_0.6.5       R6_2.5.1          mime_0.12         lifecycle_1.0.4   stringr_1.5.1    
#R>  [36] fs_1.6.5          htmlwidgets_1.6.4 miniUI_0.1.1.1    pkgconfig_2.0.3   pillar_1.10.0    
#R>  [41] bslib_0.8.0       later_1.4.1       glue_1.8.0        Rcpp_1.0.13-1     xfun_0.49        
#R>  [46] tibble_3.2.1      tidyselect_1.2.1  knitr_1.49        xtable_1.8-4      htmltools_0.5.8.1
#R>  [51] rmarkdown_2.29    compiler_4.4.2

  1. see the following repositories: HomogenFishOntario and coocc_not_inter↩︎