Spark's Work

Running Code in Parallel

When running large, time-consuming machine learning tasks, such as 5-fold cross validation, or simply trying out different models, it is crucial to leverage parallelism which can potentially save hours and days of computing time. In this short note, I will briefly describe how to use multiprocessing to complete tasks in parallel.

Before starting, I would like to note down that multiprocessing and multithread are different. I know they are different, but I am not sure exactly how. This will probably be a topic for another day.

R

My first time running something in parallel was in R with the help of two packages - foreach and doParallel, as well as reference documents here and here.

Essentially, the tasks to be run are structured in a loop which will be run in parallel. A minimal example is given as follows.

library(foreach)
library(doParallel)

cl <- makeCluster(10) # = however many processes in parallel
registerDoParallel(cl)

foreach (i in 1:100) %dopar% {
    print(i) # the task that needs to be run in parallel
}

stopCluster(cl) # stop the clusters

Within the %dopar% loop, I typically call another function that fit some machine learning model and then also save it as a data file for further analysis later.

Python

There are many ways to do things in parallel with Python. Here I will focus on using the package multiprocessing (Recall that I had an interesting episode with its naming here).

An excellent reference for multiprocessing in Python is here. In Python, we can explicitly create and start processes. In addition, if we want the main function to wait for the child process to finish before proceeding, consider using the join method (it is very important that join is called correctly). A minimal example is given below.

from multiprocessing import Process

def fun(i):
    return i

def main():
    proc = []

    for i in 1:100:
        p = Process(target=fun, args=(i, ))
        p.start()
        proc.append(p)
    
    for p in proc:
        p.join()

if __name__ == '__main__':
    main()

For saving results such as a fitted model, it may be done within the fun() function. Alternatively, we may consider collecting all results in a single list and save them all together. In the latter case, we can use Manager in the multiprocessing package (an example is given here).

Julia

In Julia, multiprocessing can be accomplished with the standard package Distributed. A introductory note is given in the official documentation here. I'd like leave it to another day because I need more time to experiment with it. :)