Based on BSPRAMP: Partitioned Memory Parallel Programming Framework
Tarun Beri Sorav Bansal Subodh Kumar
We are designing a new framework for parallel programming called BSPRAMP (named after the Bulk Synchronous PRAM model with the last `P' for partitioned memory). It consists of a distributed shared memory based simplified programming model, which leaves the application developer to focus mainly on task decomposition. This is a unified model for manycore processors (e.g., CPUs and GPUs), multiple processors on a system, as well as multiple systems. We are also developing a library implementation as a proof of concept of the model. We call our library PMLIB (Partitioned Memory parallel programming Library). The library is capable of efficiently mapping tasks to multiple compute engines, performs the required communication and schedules tasks to completion. In addition to convenience, the framework provides a race free and deadlock free programming environment by letting tasks own a partition of the memory. This simplifies programming significantly. We have done a number of experiments with PMLIB. The experiments include array summation; three scientific applications  Matrix Multiplication, Nbody Simulation and Fast Fourier Transformation; a commercial application that estimates stock pricing based on Monte Carlo equations; an implementation of PageRank, which is a popular example based on MapReducestyle programming; and a zlib based compression program.
An important design decision for PMLIB is to provide a minimalistic interface to parallelize programs, with optional controls to improve efficiency when necessary. The application need not deal with thread creation, management or destruction. Moreover, much of GPU complexities (like CUDA kernel invocation and data transfer between CPU and GPU) are also handled automatically by PMLIB. The division of work into coarsegrained tasks with explicit synchronization points provides freedom from the overheads associated with finegrained synchronization, such as race conditions and deadlocks. At present, we are working on techniques that provide fault tolerance and automatic dynamic load balancing among multiple heterogeneous processing elements in the cluster. The following sections provide more detail on the model and it's implementation.
The BSPRAMP programming model is designed to incrementally parallelize sections of an existing or new sequential program. The input and output memory, of the sequential code section to be parallelized, are partitioned among different processing elements each of which essentially runs a code similar to the original sequential program (but on a reduced data set). The advantage of this approach is that the code can always be debugged in sequential setup before being parallelized. After input and output data are distributed, the corresponding processing elements use them to perform the required computations. When all the processing elements finish their computations, they enter a barrier synchronization mode where they optionally synchronize their computational results with other processing elements. The model also allows a reduction type synchronization where data from different processing elements get associatively reduced to a final value (e.g. data from different processing elements could be added together to produce the final result).
Each parallel computation in this model, consists of a series of 3stage tasks. The three stages being data distribution, computation and synchronization. The instance of each task which runs on each processing element in parallel is known as a subtask. The model supports running subtasks on all kinds of heterogeneous computing devices (shared memory based, distributed memory based and independent graphics cards). The library implementation of this model called PMLIB uses OpenMP, MPI and CUDA underneath and allows parallelizing a program over CPU and GPU cores of all the machines in a computer network. The heterogeneity of the underlying devices, scheduling, scalability, network communications, thread and process management are completely hidden from the application programmer. The application just needs to provide CPU and GPU code for it's specific computation and rest all is transparently managed by the library.
The following table highlights major differences between BSPRAMP model and BSPRAM model. More details about BSPRAM are available in the paper "The BulkSynchronous Parallel Random Access Machine" by Alexandre Tiskin.
Programming Model Aspect 
BSPRAM 
BSPRAMP 
Data Distribution 
No input/output data distribution. All processes see entire memory. 
Each subtask owns and sees a partition of input/output memory. Optionally, subtasks may own exclusive overlapping partitions (could be entire memory). 
Data Visibility 
Programs see two levels of memory organization  local exclusive and global shared. 
Programs see only one level of partitioned exclusive memory. 
There are certain other differences that a typical model implementation and the final parallel application implementation would point out. The following table lists those differences and the advantages of using BSPRAMP model over BSPRAM model.
Implementation Aspect 
BSPRAM 
BSPRAMP 
Computation Code 
Serial code might need restructuring to fit into the parallel context. 
Subtask code is generally similar to serial code; it just operates on a smaller data set. Most advantages of using serial code like debugging are preserved. 
Reduction 
Implementations do conflict resolution to arrive on a final value from a set. 
Optimized hierarchical reduction of output partitions is done first within multiple hosts in parallel and then across hosts. 
Heterogeneity 
A shared main memory view prevents efficient incorporation of devices like GPU cards into the model. 
Partitioned memory view allows partitions to be optimized as per memory hierarchy of the target device (like global and shared GPU memories) and of the entire cluster. 
The table below lists down some parallel programs created with PMLIB and their experimental results. Most of the experiments were performed on a cluster of four 64bit eight core machines with Intel Xeon E5450 3.00 GHz CPU running Ubuntu Linux 8.04.2 connected using a 1Gbps ethernet network. Two of the hosts also have a Tesla C1060 card each. The last two columns of the table list down the speedups achieved on CPU and GPU clusters by parallel PMLIB program over the corresponding serial implementation. To view the details of any experiment, follow the link in the second column.
Category 
Experiment 
Serial Execution Time 
Parallel Execution Time on PMLIB 
PMLIB Speedup 

CPU cluster 
GPU cluster 
CPU cluster 
GPU cluster 

Scientific 
919.2 
42.7 
12.3 
21.52x 
74.73x 

2197864.0 
7392.68 
219.23 
297.3x 
10025.3x 

40.7 
8.79 
24.77 
4.63x 
1.64x 

Commercial 
5891.24 
187.67 
2.34 
31.39x 
2517.6x 

Data Analytic 
1292.98 
380.6 
408.1 
3.4x 
3.17x 

Utility 
141.78 
18.57 
 
7.63x 
 

Others 
135.76 
8.58 
2.52 
15.8x 
53.87x 
This application multiplies two square matrices of large dimensions. To parallelize the routine, we divide the rows of the first input matrix equally among subtasks. Each subtask sees the full second input matrix, and computes its part of the result. We used a simple CUDA kernel for the experiment. Our kernel uses only the device's (slow) global memory and ignores its (faster) shared memory. A more efficient CUDA kernel that exploits GPU's memory hierarchy better is likely to scale even better. For the largest matrix size of 5000*5000, we obtain a speedup of about 75x. In general, the speedup increases with increasing matrix size, as more computation is done per unit of communication for larger matrices due to the O(n3) nature of the algorithm.
This experiment is a simulation of a system of particles (or physical bodies having mass), which exert mutual gravitational pull on each other. The motion of each particle is governed by the resultant force from all other bodies in the system. We use a brute force O(n2) approach to compute gravitational forces between every pair of particles. Using PMLIB, we equally partition the work among all available processing elements.We used the kernel from CUDA SDK version 2.3 for this experiment. On GPU cluster, the parallel version of the benchmark is 326x faster than the serial version for a simulation of 50,000 Nbodies with 100 iterations. As with the matrix multiplication example, the advantage of using parallel processors increases with increasing computation. For a system of 100,000 Nbodies, the speedup (for 100 iterations) over serial implementation is 100253x! This astonishing result is due to the large number of GPU threads running in parallel.
This experiment parallelizes a wellknown discrete fourier transform algorithm, called the Fast Fourier Transform (FFT). We parallelize the computation of 2D FFT for a matrix of randomly generated time domain complex numbers. To parallelize, we first compute 1D FFT over all rows of the matrix (in parallel) and then compute 1D FFT over all columns of the resulting matrix (also in parallel). The 1D FFT computations on rows and columns is separated by a syncpoint (one task finishes and another starts). We use a CUDA kernel available in the publicly available CUFFT library for this experiment. Due to the large amount of data transfer involved in this experiment, the communication cost far exceeds the computation time. As a result, computation is fastest when done on one host rather than on the cluster. For an input matrix of dimension 2^12*2^15, the time taken for 2D FFT on one host is 8.79 seconds, while on 4 hosts it is 35.67 seconds. The amount of data transferred in this process is more than 3.0 GB, which on our 1Gbps network takes about 31.5 seconds. So the overall computation time spent by each host in parallel setup is less than 3s (this time includes the library's overheads  memory allocation, buffering, thread management and scheduling). We expect this experiment to scale better on multiple hosts if a faster network (e.g., 100Gbps ethernet) is available. On our experimental setup, we achieve a speedup of 4.63x by parallelizing on 8 CPU cores of a single host.
Monte Carlo Stock Pricing is a method of computing the expected future price of a set of stock options by considering variations along many possible paths, each of which is randomly chosen using a normal distribution. The price of the stock is assumed to follow a stochastic differential equation on each path. The final price is the average of expected prices over all paths. This process is repeated for a large number of independent stock options. We parallelize different stock options using different subtasks. We used a kernel from CUDA SDK version 2.3 for our GPU implementation of the application. For our GPU cluster, we achieve a speedup of 2517x over serial implementation for 1024 options considering variations along 2^26 paths on our experimental setup. Once again, the high speed is due to GPU's high suitability to such applications.
PageRank is an example of a dataanalytic algorithm that computes the search rank of a webpage based on the web's hyperlink structure. The search rank of a webpage is the probability of a random surfer visiting it. The algorithm works by first uniformly initializing the ranks of each page to a constant value, and then iteratively transferring the ranks of all web pages to their outlinks till the ranks of all pages converge (or till a maximum number of iterations). PageRank is a popular application of the MapReduce paradigm. We generated random web graphs of different sizes and then tested our PageRank implementation on them. We also wrote a CUDA kernel for the PageRank algorithm, and ported our serial implementation to PMLIB. Our CUDA implementation for the application is relatively dumb. The application has a large memory footprint and data sharing between CUDA threads. We again only use the GPU's global memory and avoid race conditions among threads of the GPU subtask using slow global memory atomic instructions. We implemented the iterations of the PageRank algorithm using the "callback rebounds" feature of our library. We used reduce callback to add the contributions to a page's rank from all its inlinks (just like this is done in MapReduce). We perform 100 iterations of this algorithm to arrive at a PageRank.
PageRank being a dataintensive application causes communication cost to dominate. Our implementation's data compression feature (during transfer) significantly reduces this communication cost. Like FFT, computation of PageRank is faster on CPU cores than on CUDA devices. However, the PageRank's computationtocommunication ratio is better than that of FFT. Hence, we obtain performance improvements by parallelizing over multiple hosts. Our parallel implementation is 3.4x faster over serial implementation for a web size of 30 million pages. The bottleneck in this experiment is always the data transfer time, and a faster network will improve results.
In this experiment, we perform inmemory data compression using zlib. A file is read into memory and then PMLIB is used to compress different blocks (of the file) on different processing elements in parallel. We have not written (or otherwise obtained) a CUDA kernel for this experiment yet. Hence, we only report results on parallelizing over CPUs. Using the 8 CPU cores of 1 host, our parallel program achieves a speedup of 7.63x (over serial implementation) while compressing a 1GB text file.
This experiment tests the performance of pmlib on a linearly parallelizable problem. It repeatedly computes (25 times for our experiments) the following inplace function on each element of an integer array and then finds the sum of all elements of the resulting array.
f(x) = floor( C * pow( ( sin(x), pow( cos(x), 1/3 ) ) ) )
Here, x is an element of the integer array and C is an integer constant.
To parallelize, each subtask applies the function f(x) in parallel to different parts of the input array and then computes its sum. Finally, all subtasks add their results to find the total sum using the reduce callback. The parallel implementation using our library achieves a speedup of around 54x over a serial implementation for an array of 60 million elements. This speedup is obtained if the computation is parallelized across GPU cores. When parallelizing only on CPUs, we obtain 17x speedup over serial implementation. The high speedup when using GPUs is primarily due to their relatively high compute power: the GPU subtask uses thousands of simultaneous threads.
We are working on enriching our model and library implementation with the following features:
Load Balancing
The processing elements in the compute cluster could be heterogeneous  each having different processing capabilities and at times few of them could be overloaded with other work. The conditions on various processing elements could vary dynamically and the programming model must adjust itself to the changing global environment. Load balancing becomes especially important in the context of CUDA devices as there could be a striking performance difference (depending upon the computation) between various GPU cards and between CPU and GPU devices.
Determine Best Launch Configuration
Generally, on an 8core shared memory machine, having more than 8 threads gives better performance than having one thread per CPU core. In certain situations, the converse might also be true (e.g. when the data being processed is very large to fit in memory and most of the time is spent in thrashing). Similarly, in a distributed memory setup there is a trade off between the processing power and the communication cost when the number of machines increase. A similar tradeoff exists for CUDA devices.
We need to develop a cost model to effectively estimate the increase in cost of using an additional CPU thread or a heterogeneous processing device versus the potential performance gain. The cost model must also be able to estimate the best launch configuration of different CUDA kernels (organization of threads into blocks and blocks into grids).
Efficient use of Global Memory Hierarchy
We would like the model to automatically determine the best input/output memory partition (and the extent of each partition) to be assigned to a particular processing element with the goal of minimizing the overall execution time. From a cluster's perspective, the model needs to optimally organize memory partitions across heterogeneous hosts. From a host's perspective, the model should do the same on local GPU cards and CPU cores. Finally, inside a GPU card the model must manage what should be stored in it's global memory and what should be stored in it's shared memory. A more aggressive optimization must also take care of CPU's cache behavior. The overall goal is to optimally distribute memory partitions at all levels in the cluster keeping the load balanced and increasing the computation throughput as much as possible.