Spark's Work

Creating and Using Reproducible Environments in Julia, Python and R

Having a reproducible environment is extremely important in data science. A reproducible environment ensures that multiple collaborators, or even the same person at different points in time, will always get the same results when running the same code.

I have had very different experiences with different languages when it comes to reproducibility. Overall, Julia probably offers the best experience, followed by Python and then R. Although I am in no way an expert in any of these three languages, my experience certainly speaks for someone with moderate programming skills who just want to analyze data and get their work done.

Julia

For Julia, reproducibility of packages can be easily accomplished by the Project.toml and Manifest.toml files in each Julia project directory (see here).

Moreover, the DrWatson.jl package makes it extremely easy to manage scientific projects, especially when there are multiple simulation studies to run. Their official github repo and documenration can be found here.

I have successfully used it to demonstrate the usage of LRMoE.jl to my colleagues. In fact, anyone can easily get familiar with our package by following the instructions here.

Overall, I have nothing to complain about Julia when it comes to creating and using reproducible environments for data science.

Python

For Python, I have decided to settle for virtual environments, which function basically the same way as the Julia .toml files.

The venv package (here) for virtual environments is included in the Python standard library and doesn't require additional installation. Other packages such as virtualenv (see here) may also be useful, but I will just focus on venv.

In each project folder, I can use the following command to create a new virtual environment.

python3 -m venv /path/to/new/virtual/environment

Afterwards, use the following to activate the virtual environment. This is basically a dedicated environment to install all package dependencies needed for the project.

source /path/to/new/virtual/environment/bin/activate

Now, everything executed in Python will be contained in this environment. In particular, pip install will put the installed packages in the virtual environment without messing up the system package library.

In addition, pip can also export all packages currently installed into one single file.

pip freeze > requirements.txt

Whenever there is a need to recreate an environment with the same set of packages, just tell pip to install all packages listed in the requirement file (ideally also in another virtual environment).

pip install -r requirements.txt

Overall, I think it is relatively easy to manage a reproducible virtual environment in Python once you have done it a couple of times.

R

Now comes the dreaded R language. I have encountered so many times when a package has been successfully installed but cannot be later located by library(somepackage) in the R script.

The issue with reproducibility in R seemingly comes down to how R handles and manages packages (which, from the handful of random Internet posts I have read, is absolutely a mess). An excellent blog post (here) gives a good summary of how R works in this regard.

There are some R packages available for package management such as pkgsnap (see here). However, here I will just summarize the steps for creating a separate package library for each project, which typically serves my needs.

First, create a separate folder for installing all packages for a particular project, e.g. R-pkg-for-proj.

Then, add this directory to the R library location. After this step, R will automatically search this directory for whatever package is used in R scripts.

R > .libPaths("R-pkg-for-proj")

However, I do not like this approach as it may interfere with the system package library and may not be reproducible. Instead, one can opt to explicitly specify the library location when installing and using the package.

install.packages("ggplot2", lib="R-pkg-for-proj")
library("ggplot2", lib.loc="R-pkg-for-proj")

Again, R is so inconsistent as usual with the function argument names (lib when installing, but lib.loc when using). After this step, one may consider using pkgsnap to take a snapshot of all packages installed in this folder, which can then be shared for reproducibility.

Overall, I did not have very positive experiences when it comes to package management in R, which is one of the many reasons why the R language would not be my first choice for data analysis.