Recently, I have been learning how to use the Google Cloud Platform (GCP) for moderate-scale computing problems. It took me about an hour to set up (not for the first time) an R environment on GCP for running some code for a research paper. Yes, I know we have access to computing resources offered by the university, but sometimes everyone is using it, or sometimes I just need to use more than allocated to each individual students.
Anyways, I will jot down some notes on the steps to get an environment ready for running R on GCP. Naturally, Google is a good friend for a tech rookie such as myself. In addition, without the help of LeXtudio, it would probably have taken much longer for me to figure out everything alone.
In this note, I will briefly go through the following key steps:
Creating a virtual machine (VM);
Connecting to the VM via Secured Shell (SSH) using Bitvise;
Installing and configuring R on the VM; and
Running an R script in the background.
I will skip the initial steps of setting up a Google Cloud account. You can get a free $300 dollars credit valid for one year. While using some cloud resources requires credit card information, the card will not be charged until the free credit is exhausted.
Afterwards, the official documentation here introduces how to set up a Linux virtual machine on Google Cloud.
(Side note: The very first time I used Google Cloud back in 2018, I actually set up virtual machines with a full Windows system. The cost turns out to be just too high - if the virtual machine is for computing purposes only, then graphical user interface (GUI) is not necessary. A Linux system with command lines suffices and is much more cost-efficient.)
For my case, I chose a virtual machine with CPU type
e2-standard-16. The machine is located in the zone
us-central1-f, with 16 vCPUs and 64 GB memory, which is good enough for general purposes. As my goal is to do some parallel computing, I chose 16 CPUs, which costs around $0.669 per hour - this is not high, considering I can just run the machine for 10 hours or so and shut it down afterwards.
Some details to notice:
Allow full access to cloud APIs when setting up the machine (well, just in case I use cloud APIs provided by Google).
Also allow HTTP/HTTPS traffic. After attempting and failing to understand the fundamentals of computer network, I take these options as default.
Now that the virtual machine is created on the cloud, the next step is to establish a connection via which to tell the machine to do stuff - this is accomplished by typing command lines in a secured shell (SSH). In addition, I will also need to transfer files (e.g. R scripts) between my local machine and the cloud - this will be done via SFTP, i.e. SSH File Transfer Protocol.
In order to establish a secured connection between my local machine and the cloud, there should be some password-protected login procedure.
For my university's server, I need to do the following (quite easy and straightforward):
Ask for a server address to connect to.
Once I make a request of connection, the system will prompt me to input login information, which is given by the system administrator.
For the cloud machine I set up by myself, the procedure of connection is somewhat reversed:
I first generate a Client Key file on my local machine. This is done in Bitvise: Click "Client key manager" -> "Generate new".
This will generate a public key, which is to be associated to the cloud machine: Click "Compute Engine" -> "Metadata" -> "SSH Keys"; Add the generated key to the machine, with a customized username.
On my local Bitvise, I will input the following login information.
Server: "Host" is the External IP of my virtual machine (available on the Google Cloud console). "Port" is 22.
Username: Chosen when generating the public key.
Initial method: publickey.
Client key: The client key just generated by Bitvise.
Click "Log In". This should open up both a SSH for typing a command line, as well as an SFTP window for file transfer (just the good old drag-and-drop). Neat!
(Side note: The mechanism behind the scene is roughly as follows (as explained by Lex to a computer network layman like me). When my local machine requests a connection to the cloud machine, the cloud machine sends back the public key. Since the public key is locally generated by my local machine, only my machine can decipher the key. If the key can be decoded, then a connection is established.)
After connecting to the machine, I will need to install R. I referred to here. There are three command lines to run in the SSH terminal.
The first command line updates the system package indexes - it basically checks what updates are available for currently installed packages on the Linux virtual machine.
sudo apt update
The second command downloads and installs all updates for installed packages on the machine.
sudo apt -y upgrade
Finally, install R using the following command. As of my writing, the default installation is R version 3.6.3.
sudo apt -y install r-base
Since I myself set up the virtual machine, my account is by default the administrator - meaning that I can access and install whatever I want on the machine by starting the command with
sudo (short for "superuser do"). This is also an advantage of using my own cloud machine.
Once installed, I can just start R by typing
and voila, here comes the familiar R console. Since this is a vanilla installation of R, the next step is naturally to install some packages. For R users, this is more than easy. For example, after starting R, I would like to install the
R tells me that the default directory of packages is not writable. I will just accept its recommendation and create a personal library for installing packages.
Installing packages for the first time might take quite a while, or sometimes may fail. For example, in my case, the first attempt failed because some packages cannot be installed, one of which is
curl. I tried to install it individually, and for the following error message.
------------------------- ANTICONF ERROR --------------------------- Configuration failed because libcurl was not found. Try installing: * deb: libcurl4-openssl-dev (Debian, Ubuntu, etc) * rpm: libcurl-devel (Fedora, CentOS, RHEL) * csw: libcurl_dev (Solaris) If libcurl is already installed, check that 'pkg-config' is in your PATH and PKG_CONFIG_PATH contains a libcurl.pc file. If pkg-config is unavailable you can set INCLUDE_DIR and LIB_DIR manually via: R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...' --------------------------------------------------------------------
After a bit of search, I figured this means that the package
libcurl4-openssl-dev is missing from my Linux machine (first line
deb means Debian corresponding to my case). This is an easy solve - I will quit R and install the package using a similar command as above.
q() # Leave R
sudo apt -y install libcurl4-openssl-dev
Of course, a more efficient way is to first check all Linux packages needed, and then install them all at once.
After installing some basic R packages, I also went back to the Google Cloud console and made a snapshot of my virtual machine - in case I mess up something with the system, I can just revert it to a clean state with some initial setup of environment. This can be done by clicking "Compute Engine" -> "Snapshots" -> "Create Snapshots".
The previous section already shows how to run some basic R command on the VM. However, my computation tasks are quite intensive, expecting to take hours. In this case, running R scripts in the background is the way to go - I can just get it started, close the Bitvise terminal, and come back later for the results.
As introduced before, I will use Bitvise to transfer some R scripts from my local machine to the cloud. This is quite straightforward thanks to the friendly user interface of Bitvise.
Then, I will go to the SSH terminal again. First, I will change the directory to that containing the R script. The following command shows my current working directory.
Use this command to show what's in the current directory.
Use the following command to change to another directory.
Once I have switched to the directory containing the R script, I will type the following command to run it (in the backgroud).
nohup R CMD BATCH somescript.R &
nohup means "no hung up" so that the R program runs in the background even if I leave the SSH terminal.
Whenever the R script requires loading some files/other scripts, it is crucial to correctly specify the working directory of R. This can be done by including the following code in the R script.
setwd("desired working directory on the virtual machine")
For my case, I used 15 cores for running some computation in parallel. On the Google Cloud console, it shows that aroud 93% of CPU is in use (since I have 16 CPUs in total). This is probably the most use I can get out of the current VM without crashing it. All I have to do is patiently wait and come back later to retrieve the results - job well done!
After finishing everything, I will need to stop the virtual machine by clicking some buttons on the Google Cloud console. Otherwise, I would be charged whenever the VM is turned on.
I have attempted multiple times to set up an R environment on GCP, but this time is probably the least troublesome. Still, it took around one hour, most of which was spent on setting up the public key for secured connection. Therefore, I write down these notes in case I should repeat the same procedure. Hopefully, it will take less time in the future, and I can avoid all the mistakes previously made.