Why CrossPy?#
Heterogeneous architectures, typically consisting of CPU and GPU-based systems, have become ubiquitous in clusters for scientific computing and machine learning. To harness the power of these architectures, libraries and packages have been developed in Python, the dominant programming language for scientific computing applications. NumPy/SciPy has emerged as a fundamental library for scientific computing on CPU hosts, while CuPy has been developed for GPU accelerators. Although they work efficiently with respect to the specific architecture, the challenge of programming with a mix of the libraries is left to the programmer.
Here is a simple example of array addition. On a CPU host, we use NumPy:
import numpy as np
a = np.random.rand(10)
b = np.random.rand(10)
c = a + b
print(c)
[0.69110498 0.67847303 0.71862113 0.85879546 0.82074359 1.63114738
0.77992488 0.85399978 0.39673536 1.49636408]
Simple. By replacing numpy
with cupy
, we get a single-GPU implementation:
import cupy as cp
with cp.cuda.Device(0):
ag = cp.random.rand(10)
bg = cp.random.rand(10)
cg = ag + bg
print(cg)
[0.80953309 0.34565215 1.18896198 1.08917826 1.27740323 0.88939369
0.7485445 1.19247195 0.87741949 1.41668766]
Simple, too. Well, what if we want to (or have to when the array is too large to reside on a single device) make use of multiple devices? Say, half computation with a CPU and half with a GPU:
dummy_large_number = 10
# conceptually
a_origin = np.random.rand(dummy_large_number)
b_origin = np.random.rand(dummy_large_number)
a_first_half = a_origin[:dummy_large_number // 2]
b_first_half = b_origin[:dummy_large_number // 2]
c_first_half = a_first_half + b_first_half
with cp.cuda.Device(0):
a_second_half = cp.asarray(a_origin[dummy_large_number // 2:])
b_second_half = cp.asarray(b_origin[dummy_large_number // 2:])
c_second_half = a_second_half + b_second_half
c = np.concatenate((c_first_half, cp.asnumpy(c_second_half)))
print(c)
[1.4225808 1.11218508 0.96152758 1.06182718 1.00971546 0.73793558
0.9275752 1.06724961 0.88184797 1.36970736]
Already looks cumbersome. Similarly, we can use two GPUs with minimal changes to the example above but without reducing the complexity of programming.
Now, let’s see how CrossPy programs this:
import crosspy as xp
ax = xp.array([a_first_half, a_second_half], axis=0)
bx = xp.array([b_first_half, b_second_half], axis=0)
cx = ax + bx
print(cx)
array([1.4225808 , 1.11218508, 0.96152758, 1.06182718, 1.00971546])@<CPU 0>; array([0.73793558, 0.9275752 , 1.06724961, 0.88184797, 1.36970736])@<CUDA Device 0>
As simple as the first example! CrossPy handles the complexity of cross-device manipulation and thus eliminates the burden of tedious programming. The printed cx
is a CrossPy array across two devices where the range [0:5]
is on one and [5:10]
is on the other.
print(cx.shape)
print(cx.device)
(10,)
[<CPU 0>, <CUDA Device 0>]
We can also start from the original large array and let CrossPy handle the partitioning, without adding to the lines of code:
from crosspy import cpu, gpu
ax = xp.array(a_origin, distribution=[cpu(0), gpu(0)], axis=0)
bx = xp.array(b_origin, distribution=[cpu(0), gpu(0)], axis=0)
cx = ax + bx
print(cx)
array([1.4225808 , 1.11218508, 0.96152758, 1.06182718, 1.00971546])@<CPU 0>; array([0.73793558, 0.9275752 , 1.06724961, 0.88184797, 1.36970736])@<CUDA Device 0>
This example shows the power of CrossPy that, for a single-device implementation to migrate to a multi-device version, we only need to initialize the arrays with CrossPy and specify how we want the array to be distributed.
Here we highlight an incomplete list of CrossPy features:
More details can be found in this documentation.