How to Use Docker for Study Reproducibility with R Markdown

Docker is a software product that allows for the efficient building, packaging, and deployment of applications. It uses containers, which are isolated environments that bundle software and its dependencies. These containers can run an application with all the same software, dependencies, settings, and more as were on the original machine on any other computer without affecting the host system. In this regard Docker is different from a virtual machine in that it does not require a guest operating system.

In research pipelines, introducing Docker would allow for any process in a project (e.g., data cleaning, statistical analyses, write-ups, etc.) to be containerized and shipped to other collaborators. Any differences in operating systems (OS) or software versioning will not affect the results. This is beneficial not only for collaboration purposes, but also for overall reproducibility of research. Reproducibility is the ability for identical results to be found, regardless of the researcher or machine used, from the exact same statistical analyses on the same data (see Peikert et al., 2021, for further discussion). Docker takes a snapshot of the code, dependencies, etc. used in a project that can be sent to other machines that can then reproduce the exact same results.

The goal of this article is to provide a tutorial for researchers on how to use Docker in a research project with R and R Markdown. In order to use Docker, we write something called a Dockerfile. The Dockerfile is a text file which contains all of the commands used to create an image. An image is the instruction set for creating a container. The image is the snapshot of what the container will be. The container is where the application gets run without affecting the rest of system; it is completely isolated. In sum, a Dockerfile is used to build an image, which is the template for a container. There are many premade images available for use with Docker; the largest repository of these is dockerhub.

Using Docker with R and R Markdown

To demonstrate how to integrate Docker with R and R Markdown, we are going to use an .Rmd file that conducts a statistical analysis based on data in an .RData file and generates a write-up of those results. The .Rmd file only uses one R package for simplicity’s sake. To extend this example to more extensive projects that use multiple package, using one code chunk in an .Rmd file to load all necessary packages will make the process of using Docker more convenient, as you can easily identify all packages (and their versions) to include in a Docker image.

Download and Install Docker Desktop

Before beginning this tutorial make sure to download Docker Desktop. We recommend that you close all applications and save all files before installing Docker.

Testing Docker Desktop

In order to run anything, Docker Desktop must be open and running on your machine. To test whether it’s working, open Docker Desktop. Then, in Bash (Linux or Windows Subsystem for Linux), Command Prompt (Windows), or zsh (Mac via Terminal), run:


docker run hello-world

If Docker is installed correctly, you should see:


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Using Docker

We are going to go through a simple analysis and write-up example using the HolzingerSwineford1939 dataset (Holzinger & Swineford, 1939) available in the R package lavaan (Rosseel, 2012). For the remainder of this article, CLI (command-line interface) will refer generically to Bash (Linux or Windows Subsystem for Linux), Command Prompt (Windows), or zsh (Mac via Terminal).

  1. On your Desktop, create a folder called "Docker_Project"; Throughout the process of using Docker, do not include any spaces in the names of files or folders. Download and unzip this folder containing the files needed for this example.
  2. Save the "Paper.Rmd" file to the "Docker_Project" folder.
  3. Save the "HolzingerSwineford1939.RData" file to the "Docker_Project" folder.
  4. Open Docker Desktop (we will be running Docker commands in a CLI; however, the commands will only run if Docker is open and running).
  5. Open a CLI.
  6. Set the current directory to the folder containing our project. We do this using the cd command. To use cd, simple type cd then the pathway to your Docker_Project folder, such as: cd Desktop/Docker_Project.
  7. Next, for Windows, run: type nul > Dockerfile; for Mac, run: touch Dockerfile.
    • This will create an empty text file called "Dockerfile" in your current directory.
    • In the Docker_Project folder there should be three files now: "Dockerfile," "HolzingerSwineford1939.RData," and "Paper.Rmd."
  8. In the Dockerfile we will define the following instructions:
    • FROM
      • Defines the parent image---the foundation of the image you're building (in this example, we will use a preset from dockerhub).
      • This must be the first line in the Dockerfile.
    • RUN
      • Specifies commands that will be executed to build the application.
    • ENV
      • Sets the values of environment variables (we will use it here to make sure that the most recent version of pandoc is used).
    • COPY
      • Copies files to the filesystem of the container that you construct.
    • WORKDIR
      • Sets the working directory for Docker instructions (e.g., RUN commands).
    • For more information on possible arguments to use in a Dockerfile, see Dockerfile reference or Best practices for writing Dockerfiles.
  9. Open up the Dockerfile using your preferred text editor; right now, it's blank. First, we need to decide which parent image to use. Since we used R for this project, we're going to use the the rocker parent image. To figure out what version of R to specify in the FROM line, run the following in R:
    
    version$major
    
    
    
    [1] "4"
    
    
    
    version$minor
    
    
    
    [1] "2.2"
    
    
    • This means we'll set the FROM argument to: FROM rocker/r-ver:4.2.2.
  10. Next, we need to put in the Dockerfile all necessary information to run our project by setting the RUN instruction. We will have more than one in this file.
    • Since we're using a .Rmd file that knits to a .pdf, we need to indicate in the Dockerfile all the necessary software and packages requires to knit (e.g., R Markdown, pandoc, texlive, and latex).
      
      RUN apt-get update \
        && apt-get install -y --no-install-recommends \
          wget \ 
          graphviz \ 
          texlive-latex-extra \ 
          lmodern \ 
          perl && \ 
          /rocker_scripts/install_pandoc.sh && \
          install2.r rmarkdown
      
      
      • We're going to use the apt-get update command to update all the necessary packages from their sources. To use R Markdown, pandoc, texlive, and latex, we will use the following:
    • We also want to make sure to install appropriate versions of the R packages used in our analysis. To do this, we'll use the remotes package in R.
      • The loadingPackages code chunk in "Paper.Rmd" has all the R packages used in the analysis. We can grab those package names and use the packageVersion() function to see which version of each was used:
        
        packageVersion("lavaan")
        
        
        
        [1] '0.6.14'
        
        
      • Whatever version of lavaan is available on the machine is the version to specify in the Dockerfile. In this example, the version of 0.6.14. To ensure that Docker installs a specific version of an R package, we'll use install_version() from the remotes package (Csárdi et al., 2021). Therefore, we must have the Dockerfile load both lavaan and remotes in RUN lines:
        
        RUN Rscript -e "install.packages('remotes')"
        RUN Rscript -e "remotes::install_version('lavaan', '0.6.14')"
        
        
    • Note that we must use single quotes around package and version names here to account for the fact that the entire expression is wrapped in double quotes.
  11. We set the ENV line so that Docker knows where to pull pandoc from; we will set it to:
    
    ENV RSTUDIO_PANDOC=/usr/lib/rstudio/bin/pandoc
    
    
  12. We next use COPY to copy local files into the container's filesystem, and we specify WORKDIR to indicate the working directory for commands in the Dockerfile. Here, we'll set the target location for for files in the container to /home/user/.
    • CODE is going to take two arguments: The first is "."; this indicates that files should be copied from the current working directory. The second argument is the location in the container's filesystem that we want files to be copied to.
      
      COPY . /home/user
      
      
    • WORKDIR only needs one argument: The location we want to set as the working directory for instructions in the Dockerfile 
      
      WORKDIR /home/user/
      
      
  13. Since our goal is to knit the .Rmd file into a .pdf, we are going to add one final RUN line to tell Docker to render the "Paper.Rmd" file.
    
    RUN Rscript -e "rmarkdown::render('Paper.Rmd')"
    
    
  14. This is the final Dockerfile:
    
    FROM rocker/r-ver:4.2.2
    
    RUN apt-get update \
      && apt-get install -y --no-install-recommends \
        wget \ 
        graphviz \ 
        texlive-latex-extra \ 
        lmodern \ 
        perl && \ 
        /rocker_scripts/install_pandoc.sh && \
        install2.r rmarkdown
    
    RUN Rscript -e "install.packages('remotes')"
    
    RUN Rscript -e "remotes::install_version('lavaan', '0.6.14')"
    
    ENV RSTUDIO_PANDOC=/usr/lib/rstudio/bin/pandoc
    
    COPY . /home/user
    WORKDIR /home/user/
    
    RUN Rscript -e "rmarkdown::render('Paper.Rmd')"
    
    
  15. With the completed Dockerfile in the Docker_Project folder, we can build the image. To do so, run the following in your CLI:
    
    docker build . -t analysis
    
    
    • docker calls Docker.
    • build builds the image.
    • . indicates that the image should be built from the current directory.
    • -t indicates that the following text is the name of the image we are creating, which we are calling analysis.
  16. Next, run in your CLI:
    
    docker run analysis
    
    
    • This will execute the Dockerfile commands based out of the directory we specified with WORKDIR.
    • Note that this may take a while to run, especially if it is the first time running it.
  17. Next, run the following to examine the list of containers you've constructed:
    
    docker ps -a
    # See a cheatsheet of Docker commands here:
    # https://docs.docker.com/get-started/docker_cheatsheet.pdf
    
    
    • All of the open container names will appear, and in the NAMES column, you'll see the name of the container we just constructed. We can copy the .pdf from this location and save it into our Docker_Project folder. (The name of the container in the NAMES column is simply a unique identifier for each container.)
  18. To save the .pdf to the Docker_Project folder, we can use Docker's cp command as follows:
    
    docker cp :/home/user/Paper.pdf 
    # For example, with a container named "excited_eagle" and a
    # target directory of "/Users/jane/Desktop/Docker_Project":
    docker cp excited_eagle:/home/user/Paper.pdf /Users/jane/Desktop/Docker_Project
    
    


Now we have rendered a .pdf from R Markdown using Docker and saved it to our local project directory.


References

  • Csárdi, G., Hester, J., Wickham, H., Chang, W., Morgan, M., & Tenenbaum, D. (2021). remotes: R package installation from remote repositories, including 'GitHub' (Version 2.4.2) [Computer software]. https://CRAN.R-project.org/package=remotes
  • Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, 48, xi + 91.
  • Nüst, D., Sochat, V., Marwick, B., Eglen, S. J., Head, T., Hirst, T., & Evans, B. D. (2020). Ten simple rules for writing Dockerfiles for reproducible data science. PLoS Computational Biology, 16(11), e1008316. https://doi.org/10.1371/journal.pcbi.1008316
  • Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow with R Markdown, Git, Make, and Docker. Quantitative and Computational Methods in Behavioral Sciences, 2021, e3763. https://doi.org/10.5964/qcmb.3763
  • Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1--36. https://doi.org/10.18637/jss.v048.i02

Laura Jamison
StatLab Associate
University of Virginia Library
May 12, 2023


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.