Spark's Work

Running R on Google Cloud Platform

  1. Introduction
  2. Creating a Virtual Machine
  3. Connecting to the VM via Secured Shell (SSH) using Bitvise
  4. Installing and configuring R on the VM
  5. Running an R script in the background
  6. Concluding remarks

Introduction

Recently, I have been learning how to use the Google Cloud Platform (GCP) for moderate-scale computing problems. It took me about an hour to set up (not for the first time) an R environment on GCP for running some code for a research paper. Yes, I know we have access to computing resources offered by the university, but sometimes everyone is using it, or sometimes I just need to use more than allocated to each individual students.

Anyways, I will jot down some notes on the steps to get an environment ready for running R on GCP. Naturally, Google is a good friend for a tech rookie such as myself. In addition, without the help of LeXtudio, it would probably have taken much longer for me to figure out everything alone.

In this note, I will briefly go through the following key steps:

Creating a Virtual Machine

I will skip the initial steps of setting up a Google Cloud account. You can get a free $300 dollars credit valid for one year. While using some cloud resources requires credit card information, the card will not be charged until the free credit is exhausted.

Afterwards, the official documentation here introduces how to set up a Linux virtual machine on Google Cloud.

(Side note: The very first time I used Google Cloud back in 2018, I actually set up virtual machines with a full Windows system. The cost turns out to be just too high - if the virtual machine is for computing purposes only, then graphical user interface (GUI) is not necessary. A Linux system with command lines suffices and is much more cost-efficient.)

For my case, I chose a virtual machine with CPU type e2-standard-16. The machine is located in the zone us-central1-f, with 16 vCPUs and 64 GB memory, which is good enough for general purposes. As my goal is to do some parallel computing, I chose 16 CPUs, which costs around $0.669 per hour - this is not high, considering I can just run the machine for 10 hours or so and shut it down afterwards.

Some details to notice:

Connecting to the VM via Secured Shell (SSH) using Bitvise

Now that the virtual machine is created on the cloud, the next step is to establish a connection via which to tell the machine to do stuff - this is accomplished by typing command lines in a secured shell (SSH). In addition, I will also need to transfer files (e.g. R scripts) between my local machine and the cloud - this will be done via SFTP, i.e. SSH File Transfer Protocol.

Previously, I used PuTTY for command lines and FileZilla for file transfer. It turns out that Bitvise can do both tasks, so I will use this as an example.

In order to establish a secured connection between my local machine and the cloud, there should be some password-protected login procedure.

For my university's server, I need to do the following (quite easy and straightforward):

For the cloud machine I set up by myself, the procedure of connection is somewhat reversed:

(Side note: The mechanism behind the scene is roughly as follows (as explained by Lex to a computer network layman like me). When my local machine requests a connection to the cloud machine, the cloud machine sends back the public key. Since the public key is locally generated by my local machine, only my machine can decipher the key. If the key can be decoded, then a connection is established.)

Installing and configuring R on the VM

After connecting to the machine, I will need to install R. I referred to here. There are three command lines to run in the SSH terminal.

The first command line updates the system package indexes - it basically checks what updates are available for currently installed packages on the Linux virtual machine.

sudo apt update

The second command downloads and installs all updates for installed packages on the machine.

sudo apt -y upgrade

Finally, install R using the following command. As of my writing, the default installation is R version 3.6.3.

sudo apt -y install r-base

Since I myself set up the virtual machine, my account is by default the administrator - meaning that I can access and install whatever I want on the machine by starting the command with sudo (short for "superuser do"). This is also an advantage of using my own cloud machine.

Once installed, I can just start R by typing

R

and voila, here comes the familiar R console. Since this is a vanilla installation of R, the next step is naturally to install some packages. For R users, this is more than easy. For example, after starting R, I would like to install the devtools package.

install.packages("devtools")

R tells me that the default directory of packages is not writable. I will just accept its recommendation and create a personal library for installing packages.

Installing packages for the first time might take quite a while, or sometimes may fail. For example, in my case, the first attempt failed because some packages cannot be installed, one of which is curl. I tried to install it individually, and for the following error message.

------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libcurl was not found. Try installing:
 * deb: libcurl4-openssl-dev (Debian, Ubuntu, etc)
 * rpm: libcurl-devel (Fedora, CentOS, RHEL)
 * csw: libcurl_dev (Solaris)
If libcurl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libcurl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------

After a bit of search, I figured this means that the package libcurl4-openssl-dev is missing from my Linux machine (first line deb means Debian corresponding to my case). This is an easy solve - I will quit R and install the package using a similar command as above.

q() # Leave R
sudo apt -y install libcurl4-openssl-dev

Of course, a more efficient way is to first check all Linux packages needed, and then install them all at once.

After installing some basic R packages, I also went back to the Google Cloud console and made a snapshot of my virtual machine - in case I mess up something with the system, I can just revert it to a clean state with some initial setup of environment. This can be done by clicking "Compute Engine" -> "Snapshots" -> "Create Snapshots".

Running an R script in the background

The previous section already shows how to run some basic R command on the VM. However, my computation tasks are quite intensive, expecting to take hours. In this case, running R scripts in the background is the way to go - I can just get it started, close the Bitvise terminal, and come back later for the results.

As introduced before, I will use Bitvise to transfer some R scripts from my local machine to the cloud. This is quite straightforward thanks to the friendly user interface of Bitvise.

Then, I will go to the SSH terminal again. First, I will change the directory to that containing the R script. The following command shows my current working directory.

pwd

Use this command to show what's in the current directory.

ls

Use the following command to change to another directory.

cd desired-directory

Once I have switched to the directory containing the R script, I will type the following command to run it (in the backgroud).

nohup R CMD BATCH somescript.R &

where nohup means "no hung up" so that the R program runs in the background even if I leave the SSH terminal.

Whenever the R script requires loading some files/other scripts, it is crucial to correctly specify the working directory of R. This can be done by including the following code in the R script.

setwd("desired working directory on the virtual machine")

For my case, I used 15 cores for running some computation in parallel. On the Google Cloud console, it shows that aroud 93% of CPU is in use (since I have 16 CPUs in total). This is probably the most use I can get out of the current VM without crashing it. All I have to do is patiently wait and come back later to retrieve the results - job well done!

After finishing everything, I will need to stop the virtual machine by clicking some buttons on the Google Cloud console. Otherwise, I would be charged whenever the VM is turned on.

Concluding remarks

I have attempted multiple times to set up an R environment on GCP, but this time is probably the least troublesome. Still, it took around one hour, most of which was spent on setting up the public key for secured connection. Therefore, I write down these notes in case I should repeat the same procedure. Hopefully, it will take less time in the future, and I can avoid all the mistakes previously made.