Why CrossPy?#

Heterogeneous architectures, typically consisting of CPU and GPU-based systems, have become ubiquitous in clusters for scientific computing and machine learning. To harness the power of these architectures, libraries and packages have been developed in Python, the dominant programming language for scientific computing applications. NumPy/SciPy has emerged as a fundamental library for scientific computing on CPU hosts, while CuPy has been developed for GPU accelerators. Although they work efficiently with respect to the specific architecture, the challenge of programming with a mix of the libraries is left to the programmer.

Here is a simple example of array addition. On a CPU host, we use NumPy:

import numpy as np

a = np.random.rand(10)
b = np.random.rand(10)
c = a + b
print(c)

[0.69110498 0.67847303 0.71862113 0.85879546 0.82074359 1.63114738
 0.77992488 0.85399978 0.39673536 1.49636408]

Simple. By replacing numpy with cupy, we get a single-GPU implementation:

import cupy as cp

with cp.cuda.Device(0):
    ag = cp.random.rand(10)
    bg = cp.random.rand(10)
    cg = ag + bg
print(cg)

[0.80953309 0.34565215 1.18896198 1.08917826 1.27740323 0.88939369
 0.7485445  1.19247195 0.87741949 1.41668766]

Simple, too. Well, what if we want to (or have to when the array is too large to reside on a single device) make use of multiple devices? Say, half computation with a CPU and half with a GPU:

dummy_large_number = 10

# conceptually
a_origin = np.random.rand(dummy_large_number)
b_origin = np.random.rand(dummy_large_number)

a_first_half = a_origin[:dummy_large_number // 2]
b_first_half = b_origin[:dummy_large_number // 2]
c_first_half = a_first_half + b_first_half

with cp.cuda.Device(0):
    a_second_half = cp.asarray(a_origin[dummy_large_number // 2:])
    b_second_half = cp.asarray(b_origin[dummy_large_number // 2:])
    c_second_half = a_second_half + b_second_half

c = np.concatenate((c_first_half, cp.asnumpy(c_second_half)))
print(c)

[1.4225808  1.11218508 0.96152758 1.06182718 1.00971546 0.73793558
 0.9275752  1.06724961 0.88184797 1.36970736]

Already looks cumbersome. Similarly, we can use two GPUs with minimal changes to the example above but without reducing the complexity of programming.

Now, let’s see how CrossPy programs this:

import crosspy as xp

ax = xp.array([a_first_half, a_second_half], axis=0)
bx = xp.array([b_first_half, b_second_half], axis=0)
cx = ax + bx
print(cx)

array([1.4225808 , 1.11218508, 0.96152758, 1.06182718, 1.00971546])@<CPU 0>; array([0.73793558, 0.9275752 , 1.06724961, 0.88184797, 1.36970736])@<CUDA Device 0>

As simple as the first example! CrossPy handles the complexity of cross-device manipulation and thus eliminates the burden of tedious programming. The printed cx is a CrossPy array across two devices where the range [0:5] is on one and [5:10] is on the other.

print(cx.shape)
print(cx.device)

(10,)
[<CPU 0>, <CUDA Device 0>]

We can also start from the original large array and let CrossPy handle the partitioning, without adding to the lines of code:

from crosspy import cpu, gpu

ax = xp.array(a_origin, distribution=[cpu(0), gpu(0)], axis=0)
bx = xp.array(b_origin, distribution=[cpu(0), gpu(0)], axis=0)
cx = ax + bx
print(cx)

array([1.4225808 , 1.11218508, 0.96152758, 1.06182718, 1.00971546])@<CPU 0>; array([0.73793558, 0.9275752 , 1.06724961, 0.88184797, 1.36970736])@<CUDA Device 0>

This example shows the power of CrossPy that, for a single-device implementation to migrate to a multi-device version, we only need to initialize the arrays with CrossPy and specify how we want the array to be distributed.

Here we highlight an incomplete list of CrossPy features:

More details can be found in this documentation.