Having a reproducible environment is extremely important in data science. A reproducible environment ensures that multiple collaborators, or even the same person at different points in time, will always get the same results when running the same code.
I have had very different experiences with different languages when it comes to reproducibility. Overall, Julia probably offers the best experience, followed by Python and then R. Although I am in no way an expert in any of these three languages, my experience certainly speaks for someone with moderate programming skills who just want to analyze data and get their work done.
For Julia, reproducibility of packages can be easily accomplished by the Project.toml
and Manifest.toml
files in each Julia project directory (see here).
Moreover, the DrWatson.jl
package makes it extremely easy to manage scientific projects, especially when there are multiple simulation studies to run. Their official github repo and documenration can be found here.
I have successfully used it to demonstrate the usage of LRMoE.jl
to my colleagues. In fact, anyone can easily get familiar with our package by following the instructions here.
Overall, I have nothing to complain about Julia when it comes to creating and using reproducible environments for data science.
For Python, I have decided to settle for virtual environments, which function basically the same way as the Julia .toml
files.
The venv
package (here) for virtual environments is included in the Python standard library and doesn't require additional installation. Other packages such as virtualenv
(see here) may also be useful, but I will just focus on venv
.
In each project folder, I can use the following command to create a new virtual environment.
python3 -m venv /path/to/new/virtual/environment
Afterwards, use the following to activate the virtual environment. This is basically a dedicated environment to install all package dependencies needed for the project.
source /path/to/new/virtual/environment/bin/activate
Now, everything executed in Python will be contained in this environment. In particular, pip install
will put the installed packages in the virtual environment without messing up the system package library.
In addition, pip
can also export all packages currently installed into one single file.
pip freeze > requirements.txt
Whenever there is a need to recreate an environment with the same set of packages, just tell pip
to install all packages listed in the requirement file (ideally also in another virtual environment).
pip install -r requirements.txt
Overall, I think it is relatively easy to manage a reproducible virtual environment in Python once you have done it a couple of times.
Now comes the dreaded R language. I have encountered so many times when a package has been successfully installed but cannot be later located by library(somepackage)
in the R script.
The issue with reproducibility in R seemingly comes down to how R handles and manages packages (which, from the handful of random Internet posts I have read, is absolutely a mess). An excellent blog post (here) gives a good summary of how R works in this regard.
There are some R packages available for package management such as pkgsnap
(see here). However, here I will just summarize the steps for creating a separate package library for each project, which typically serves my needs.
First, create a separate folder for installing all packages for a particular project, e.g. R-pkg-for-proj
.
Then, add this directory to the R library location. After this step, R will automatically search this directory for whatever package is used in R scripts.
R > .libPaths("R-pkg-for-proj")
However, I do not like this approach as it may interfere with the system package library and may not be reproducible. Instead, one can opt to explicitly specify the library location when installing and using the package.
install.packages("ggplot2", lib="R-pkg-for-proj")
library("ggplot2", lib.loc="R-pkg-for-proj")
Again, R is so inconsistent as usual with the function argument names (lib
when installing, but lib.loc
when using). After this step, one may consider using pkgsnap
to take a snapshot of all packages installed in this folder, which can then be shared for reproducibility.
Overall, I did not have very positive experiences when it comes to package management in R, which is one of the many reasons why the R language would not be my first choice for data analysis.