Introduction
This article shows how to get started with the command
packages. It focuses on the main function in the package,
cmd_assign().
cmd_assign() is used to specify the inputs and outputs
for a step in a data analysis workflow. It can be called in two
ways:
- interatively, or
- inside a script that is run from the command line.
Case 2 is the important one. However, Case 1 is simpler, so we look at that first.
Calling cmd_assign() interactively
Running
has the same effect as running
obj1 <- "orange"
obj2 <- 1Both snippets add two objects to the global environment with names
"obj1" and "obj2" and with values
"orange" and 1.
Objects added to the global environment by cmd_assign()
can have the following classes:
- character, eg
"Hello world" - integer, eg
3L - numeric, eg
3.141593 - date, eg
as.Date("2015-11-03") - date-time using POSIXct, eg
as.POSIXct("2015-11-03 14:23:03") - date-time using POSIXlt, eg
as.POSIXlt("2015-11-03 14:23:03") NULL
The objects must have length 1, except for NULL, which
has length 0.
The typical reason for using cmd_assign() interactively
is to develop code that will eventually sit in a script that is run from
the command line.
The most common way to run R scripts from the command line is with
Rscript, so we take a detour to look at that.
Rscript
Rscript is an application for running R scripts from the
command line. For more detail on the command line, see, for instance,
episodes 1–3 of The Unix
Shell. For an introduction to Rscript, see, for
instance, Command-Line
Programs.
The simplest usage of Rscript is a command such as
This launches a new R session, runs whatever code is in
myfile.R, and ends the session. When the session ends, all
objects created during the session disappear, unless they were saved to
disk.
Rscript accepts additional arguments, which are placed
after name of the script being run, as in
In this example, --n_iteration=10 is a named argument,
and output.rds is an unnamed argument. Named arguments have
the format
--<name>=<value>
Note that there must not be a space between the name, the
= sign, and the value. The following command would
therefore be invalid:
Named arguments can also have a single dash and single letter, as in
When Rscript is called with additional arguments,
Rscript supplies the names and values for these arguments
to the R session. The names and values can be accessed from within the
session using base R function commandArgs().
Working with commandArgs() is, however, tricky.
cmd_assign() is an alternative to
commandArgs(), designed specifically for data analysis
workflows.
One alternative to Rscript is the package littler.
littler does not, however, run on Windows, and requires
some command line magic before it will run on macOS.
cmd_assign() works as normal with littler.
Calling cmd_assign() inside a script
We will work through an example where we call
cmd_assign() inside a script that is being run from the
command line.
Our current working directory contains two files:
.
├── airmiles.csv
└── fig_smoothed.R
The file airmiles.csv holds data on annual passenger
numbers:
year,passengers
1937,412
1938,480
1939,683
1940,1052
1941,1385
1942,1418
1943,1634
1944,2178
1945,3362
1946,5948
1947,6109
1948,5981
1949,6753
1950,8003
The file fig_smoothed.R contains the following code:
## Specify packages, inputs, and outputs ------------------
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(command)
})
cmd_assign(.airmiles = "data/airmiles.csv",
n_knot = 10,
.out = "fig_smoothed.png")
## Read in data -------------------------------------------
airmiles <- read.csv(.airmiles)
## Analyse ------------------------------------------------
smoothed <- airmiles |>
mutate(smoothed = fitted(smooth.spline(x = passengers,
nknots = n_knot)))
p <- ggplot(smoothed, aes(x = year)) +
geom_line(aes(y = smoothed)) +
geom_point(aes(y = passengers)) +
ggtitle(paste("Smoothed using", n_knot, "knots"))
## Save results -------------------------------------------
png(file = .out, width = 200, height = 200)
plot(p)
dev.off()
This code
- reads in data from a file whose name is specified by
.airmiles, - smooths the passenger data, with the amount of smoothing controlled
by
n_knot, - plots the smoothed and unsmoothed values, and
- writes the plot out to a file whose name is specified by
.out.
We use Rscript to run the code from the command
line.
✔ Assigned object `.airmiles` with value "airmiles.csv" and class "character".
✔ Assigned object `n_knot` with value 8 and class "numeric".
✔ Assigned object `.out` with value "fig_smoothed_8.png" and class "character".
null device
1
Rscript started a new R session, ran the code using the
command line arguments we passed in, and ended the session. The messages
about assigning objects were generated by cmd_assign().
The call to cmd_assign() in fig_smoothed.R
created objects .airmiles, n_knot, and
.out inside the R session. The values for these objects
were taken from the command line, and not from the original
call to cmd_assign(). Hence, n_knot equaled
8 rather than 10, and .out
equaled "fig_smoothed_8.png" rather than
"fig_smoothed.png".
Our working directory now looks like this:
.
├── airmiles.csv
├── fig_smoothed.R
└── fig_smoothed_8.png
We have a new file called "fig_smoothed_8.png".
What cmd_assign() does when called in a script
When cmd_assign() is called in a script that is being
run from the command line, it does three things:
-
Match command line arguments against arguments in
cmd_assign(). -
Coerce each value supplied at the command line to
have the same class as the corresponding value in the call to
cmd_assign(). - Assign the resulting values to the global environment.
Suppose we have a script called model.R containing the
following call to cmd_assign():
cmd_assign(.data = "data/dataset.csv",
n_iter = 5,
use_log = TRUE,
.out = "out/model.rds")We run model.R from the command line using
(The backslashes \ here mean “treat the next line of
text as a continuation of the current line”.)
When cmd_assign() is first called, it holds the
following values:
| Argument | Value from call | Value from command line |
|---|---|---|
.data |
"data/dataset.csv" |
<none> |
n_iter |
5 |
<none> |
use_log |
TRUE |
<none> |
.out |
"out/model.rds" |
<none> |
In the match step, cmd_assign() finds
the values that were passed in from the command line. First it matches
named arguments from the command line with named arguments from the
call, yielding
| Argument | Value from call | Value from command line |
|---|---|---|
.data |
"data/dataset.csv" |
<none> |
n_iter |
5 |
10 |
use_log |
TRUE |
"TRUE" |
.out |
"out/model.rds" |
<none> |
Then it matches unnamed arguments from the command line with unused
arguments from the call. The matching of unnamed arguments is based on
the order in which the unnamed arguments were supplied to the command
line. In our example, the value "data/dataset2.csv" was
supplied before "out/model2.rds", so
"data/dataset2.csv" comes before
"out/model2.rds" in the matched results.
| Argument | Value from call | Value from command line |
|---|---|---|
.data |
"data/dataset.csv" |
"data/dataset2.csv" |
n_iter |
5 |
"10" |
use_log |
TRUE |
"TRUE" |
.out |
"out/model.rds" |
"out/model2.rds" |
The values supplied at the command line all start out as text
strings. In the coerce step, cmd_assign()
converts these values to have the same classs as the matched values from
cmd_assign(). In our example, this means coercing
"10" to numeric and coercing
"TRUE" to logical.
| Argument | Value from call | Value from command line |
|---|---|---|
.data |
"data/dataset.csv" |
"data/dataset2.csv" |
n_iter |
5 |
10 |
use_long |
TRUE |
TRUE |
.out |
"out/model.rds" |
"out/model2.rds" |
Finally, in the assign step,
cmd_assign() puts the values in the global environment.
The number of arguments passed through the command line must exactly
match the number of arguments specified in the call to
cmd_assign(). Values specified in the call to
cmd_assign() do not act as defaults. For instance,
in our example, cmd_assign() would not let us omit
use_log, and the following would be invalid:
The rationale for not having defaults is that specifying all the arguments makes a workflow clearer and safer.
Argument names starting with a dot
In all the examples so far, objects holding paths or filenames have conformed to a particular naming convention: they have all had names that start with a dot. For instance:
cmd_assign(.data = "data/dataset.csv", # '.data'
n_iter = 5,
use_log = TRUE,
.out = "out/model.rds") # '.out'The aim is to distinguish between values and references
to values. n_iter and use_log in the
example above hold values that are directly used in the analysis.
.data and .out, in contrast, specify the place
where the values used in the analysis are stored.
To access the values referred to by the “dot” variables, we use a
function such as readRDS() or read_csv(), as
in
data <- read_csv(.data)Function cmd_assign() itself makes no distinction
between names that start with dots and names that do not. However,
functions extract_shell(), shell(), discussed
in Creating
a Shell Script, and extract_make() and
makefile(), discussed in Creating
a Makefile, do make a distinction. Moreover, the convention is
useful anyway, to separate values from references.
Giving code files and outputs the same base name
Another feature of the examples so far is that R files have the same
base name as the files that they generate. File
fig_smoothed.R, for instance, generates
fig_smooth.png, fig_smooth_5.png,
fig_smooth_8.png, and fig_smooth_10.png, and
file model.R generates model.rds and
model2.rds. We rely on file extensions (eg .R
vs .png, or .R vs .rds) to
distinguish code from outputs, and we use suffixes (eg _5,
_8 and _10) to distinguish different versions
of the same output.
The same-base-names convention is common in programming, and is a good way to signal the relationship between code and outputs.
Putting calls to Rscript in a shell script
We can control a data analysis workflow by writing a shell script
with calls to Rscript.
We illustrate with a simple example. We need two more files. The
first file, called report.qmd, creates a report with two
figures:
---
title: "Smoothed Passenger Miles"
format: html
---
We smoothing our data using a spline with 5 knots:
```{r}
#| label: fig_smoothed_5
#| echo: false
knitr::include_graphics("fig_smoothed_5.png")
```
And smooth it again using a spline with 10 knots
```{r}
#| label: fig_smoothed_10
#| echo: false
knitr::include_graphics("fig_smoothed_10.png")
```
The second file, called workflow.sh, is a shell script
that runs the whole workflow. It contains the following code:
Rscript fig_smoothed.R airmiles.csv 5 fig_smoothed_5.png
Rscript fig_smoothed.R airmiles.csv 10 fig_smoothed_10.png
quarto render report.qmd
(The third command in the code above calls the quarto
application directly, rather than through the R function quarto_render.)
We run workflow.sh.
✔ Assigned object `.airmiles` with value "airmiles.csv" and class "character".
✔ Assigned object `n_knot` with value 5 and class "numeric".
✔ Assigned object `.out` with value "fig_smoothed_5.png" and class "character".
null device
1
✔ Assigned object `.airmiles` with value "airmiles.csv" and class "character".
✔ Assigned object `n_knot` with value 10 and class "numeric".
✔ Assigned object `.out` with value "fig_smoothed_10.png" and class "character".
null device
1
processing file: report.qmd
1/5
2/5 [fig_smoothed_5]
3/5
4/5 [fig_smoothed_10]
5/5
output file: report.knit.md
pandoc
to: html
output-file: report.html
standalone: true
section-divs: true
html-math-method: mathjax
wrap: none
default-image-extension: png
variables: {}
metadata
document-css: false
link-citations: true
date-format: long
lang: en
title: Smoothed Passenger Miles
Output created: report.html
Our working directory now contains the two graphs and the report
(plus a directory, called report_files, created by
quarto.)
.
├── airmiles.csv
├── fig_smoothed.R
├── fig_smoothed_10.png
├── fig_smoothed_5.png
├── fig_smoothed_8.png
├── report.html
├── report.qmd
├── report_files
└── workflow.sh
Putting calls to Rscript in a Makefile
An even better way to organize a data analysis workflow is to put the
Rscript commands in a Makefile. For an introduction to
Makefiles, see Project
Management with Make.
Here is the Makefile equivalent of the shell script above:
.PHONY: all
all: report.html
fig_smoothed_5.png: fig_smoothed.R airmiles.csv
Rscript $^ $@ --n_knot=5
fig_smoothed_10.png: fig_smoothed.R airmiles.csv
Rscript $^ $@ --n_knot=10
report.html: report.qmd fig_smoothed_5.png fig_smoothed_10.png
quarto render $<
The line all: report.html says that our ultimate aim is
to create a file called report.html. The remaining lines
describe the steps to get there, including the output (eg
fig_smoothed_5.png), the inputs (eg
fig_smoothed.R and airmiles.csv), and the
action needed to produce the output from the inputs (eg
Rscript $^ $@ --n_knot=5). Modular
Workflows for Data Analysis contains more information on Makefiles,
including an explanation of the mysterious $^ and
$@ symbols.
Makefiles are run using the make application.
make is present on all Linux and MacOS systems, but may
need installing on Windows.
Makefiles and make take time to master. A Makefile is,
however, an excellent way to document dependencies within a workflow.
make can also be used to update a workflow safely and
efficiently, because, when it updates, make keeps track of
dependencies.
Observe what happens, for instance, if we delete
fig_smoothed_5.png and then run make.
Rscript fig_smoothed.R airmiles.csv fig_smoothed_5.png --n_knot=5
✔ Assigned object `.airmiles` with value "airmiles.csv" and class "character".
✔ Assigned object `n_knot` with value 5 and class "numeric".
✔ Assigned object `.out` with value "fig_smoothed_5.png" and class "character".
null device
1
quarto render report.qmd
processing file: report.qmd
1/5
2/5 [fig_smoothed_5]
3/5
4/5 [fig_smoothed_10]
5/5
output file: report.knit.md
pandoc
to: html
output-file: report.html
standalone: true
section-divs: true
html-math-method: mathjax
wrap: none
default-image-extension: png
variables: {}
metadata
document-css: false
link-citations: true
date-format: long
lang: en
title: Smoothed Passenger Miles
Output created: report.html
make runs the code for creating
fig_smoothed_5.png, and then, because it knows that
report.html depends on fig_smoothed_5.png, it
runs the code for that too. It does, not, however, run the code for
fig_smoothed_10.png, because it knows that
fig_smoothed_10.png does not depend on
fig_smoothed_5.png.
Alternatives to cmd_assign()
Function commandArgs() in package R.utils is a
user-friendly version of the base R function. Packages providing tools
for parsing command line include argparse, docopt, getopt, and optparse.
command is, however, the only package which focuses
specifically on processing inputs and outputs within a data analysis
workflows.
The command package is designed for workflows where the
basic building block is the file. An alternative to this ‘file-oriented’
approach is a ‘function-oriented’ approach. Tools for implementing a
function-oriented approach are implemented by package targets.