The NVECTOR_CUDA Module

The NVECTOR_CUDA module is an NVECTOR implementation in the CUDA language. The module allows for SUNDIALS vector kernels to run on NVIDIA GPU devices. It is intended for users who are already familiar with CUDA and GPU programming. Building this vector module requires a CUDA compiler and, by extension, a C++ compiler. The vector content layout is as follows:

struct _N_VectorContent_Cuda
{
  sunindextype       length;
  booleantype        own_data;
  realtype*          host_data;
  realtype*          device_data;
  SUNCudaExecPolicy* stream_exec_policy;
  SUNCudaExecPolicy* reduce_exec_policy;
  void*              priv; /* 'private' data */
};

typedef struct _N_VectorContent_Cuda *N_VectorContent_Cuda;

The content members are the vector length (size), a boolean flag that signals if the vector owns the data (i.e. it is in charge of freeing the data), pointers to vector data on the host and the device, pointers to SUNCudaExecPolicy implementations that control how the CUDA kernels are launched for streaming and reduction vector kernels, and a private data structure which holds additonal members that should not be accessed directly.

When instantiated with N_VNew_Cuda, the underlying data will be allocated memory on both the host and the device. Alternatively, a user can provide host and device data arrays by using the N_VMake_Cuda constructor. To use CUDA managed memory, the constructors N_VNewManaged_Cuda and N_VMakeManaged_Cuda are provided. Details on each of these constructors are provided below.

To use the NVECTOR_CUDA module, include nvector_cuda.h and link to the library libsundials_nveccuda.lib. The extension, .lib, is typically .so for shared libraries and .a for static libraries.

NVECTOR_CUDA functions

Unlike other native SUNDIALS vector types, the NVECTOR_CUDA module does not provide macros to access its member variables. Instead, user should use the accessor functions:

realtype* N_VGetHostArrayPointer_Cuda(N_Vector v)

This function returns pointer to the vector data on the host.

realtype* N_VGetDeviceArrayPointer_Cuda(N_Vector v)

This function returns pointer to the vector data on the device.

booleantype N_VIsManagedMemory_Cuda(N_Vector v)

This function returns a boolean flag indiciating if the vector data array is in managed memory or not.

The NVECTOR_CUDA module defines implementations of all standard vector operations defined in the sections Description of the NVECTOR operations, Description of the NVECTOR fused operations, Description of the NVECTOR vector array operations, and Description of the NVECTOR local reduction operations, except for N_VSetArrayPointer, and, if using unmanaged memory, N_VGetArrayPointer. As such, this vector can only be used with SUNDIALS Fortran interfaces, and the SUNDIALS direct solvers and preconditioners when using managed memory. The NVECTOR_CUDA module provides separate functions to access data on the host and on the device for the unmanaged memory use case. It also provides methods for copying from the host to the device and vice versa. Usage examples of NVECTOR_CUDA are provided in example programs for CVODE [HSR2017].

The names of vector operations are obtained from those in the sections Description of the NVECTOR operations, Description of the NVECTOR fused operations, Description of the NVECTOR vector array operations, and Description of the NVECTOR local reduction operations by appending the suffix _Cuda (e.g. N_VDestroy_Cuda). The module NVECTOR_CUDA provides the following additional user-callable routines:

N_Vector N_VNew_Cuda(sunindextype length)

This function creates and allocates memory for a CUDA N_Vector. The vector data array is allocated on both the host and device.

N_Vector N_VNewManaged_Cuda(sunindextype vec_length)

This function creates and allocates memory for a CUDA N_Vector. The vector data array is allocated in managed memory.

N_Vector N_VNewEmpty_Cuda(sunindextype vec_length)

This function creates a new N_Vector wrapper with the pointer to the wrapped CUDA vector set to NULL. It is used by N_VNew_Cuda(), N_VMake_Cuda(), and N_VClone_Cuda() implementations.

N_Vector N_VMake_Cuda(sunindextype vec_length, realtype *h_vdata, realtype *d_vdata)

This function creates a CUDA N_Vector with user-supplied vector data arrays for the host and the device.

N_Vector N_VMakeManaged_Cuda(sunindextype vec_length, realtype *vdata)

This function creates a CUDA N_Vector with a user-supplied managed memory data array.

N_Vector N_VMakeWithManagedAllocator_Cuda(sunindextype length, void* (*allocfn)(size_t size), void (*freefn)(void* ptr))

This function creates a CUDA N_Vector with a user-supplied memory allocator. It requires the user to provide a corresponding free function as well. The memory allocated by the allocator function must behave like CUDA managed memory.

The module NVECTOR_CUDA also provides the following user-callable routines:

void N_VSetKernelExecPolicy_Cuda(N_Vector v,
SUNCudaExecPolicy* stream_exec_policy,
SUNCudaExecPolicy* reduce_exec_policy)

This function sets the execution policies which control the kernel parameters utilized when launching the streaming and reduction CUDA kernels. By default the vector is setup to use the SUNCudaThreadDirectExecPolicy and SUNCudaBlockReduceExecPolicy. Any custom execution policy for reductions must ensure that the grid dimensions (number of thread blocks) is a multiple of the CUDA warp size (32). See section The SUNCudaExecPolicy Class below for more information about the SUNCudaExecPolicy class.

Note: All vectors used in a single instance of a {sundials} solver must use the same execution policy. It is **strongly recommended* that this function is called immediately after constructing the vector, and any subsequent vector be created by cloning to ensure consistent execution policies across vectors*

void N_VSetCudaStream_Cuda(N_Vector v, cudaStream_t *stream)

DEPRECATED This function will be removed in the next major release, user should utilize the N_VSetKernelExecPolicy_Cuda function instead.

This function sets the CUDA stream that all vector kernels will be launched on. By default an NVECTOR_CUDA uses the default CUDA stream.

Note: All vectors used in a single instance of a {sundials} solver must use the same CUDA stream. It is **strongly recommended* that this function is called immediately after constructing the vector, and any subsequent vector be created by cloning to ensure consistent execution policies across vectors*

realtype* N_VCopyToDevice_Cuda(N_Vector v)

This function copies host vector data to the device.

realtype* N_VCopyFromDevice_Cuda(N_Vector v)

This function copies vector data from the device to the host.

void N_VPrint_Cuda(N_Vector v)

This function prints the content of a CUDA vector to stdout.

void N_VPrintFile_Cuda(N_Vector v, FILE *outfile)

This function prints the content of a CUDA vector to outfile.

By default all fused and vector array operations are disabled in the NVECTOR_CUDA module. The following additional user-callable routines are provided to enable or disable fused and vector array operations for a specific vector. To ensure consistency across vectors it is recommended to first create a vector with N_VNew_Cuda(), enable/disable the desired operations for that vector with the functions below, and create any additional vectors from that vector using N_VClone(). This guarantees the new vectors will have the same operations enabled/disabled as cloned vectors inherit the same enable/disable options as the vector they are cloned from while vectors created with N_VNew_Cuda() will have the default settings for the NVECTOR_CUDA module.

int N_VEnableFusedOps_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) all fused and vector array operations in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableLinearCombination_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the linear combination fused operation in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableScaleAddMulti_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the scale and add a vector to multiple vectors fused operation in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableDotProdMulti_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the multiple dot products fused operation in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableLinearSumVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the linear sum operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableScaleVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the scale operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableConstVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the const operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableWrmsNormVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the WRMS norm operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableWrmsNormMaskVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the masked WRMS norm operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableScaleAddMultiVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the scale and add a vector array to multiple vector arrays operation in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

int N_VEnableLinearCombinationVectorArray_Cuda(N_Vector v, booleantype tf)

This function enables (SUNTRUE) or disables (SUNFALSE) the linear combination operation for vector arrays in the CUDA vector. The return value is 0 for success and -1 if the input vector or its ops structure are NULL.

Notes

  • When there is a need to access components of an N_Vector_Cuda, v, it is recommeded to use functions N_VGetDeviceArrayPointer_Cuda() or N_VGetHostArrayPointer_Cuda(). However, when using managed memory, the function N_VGetArrayPointer() may also be used.
  • To maximize efficiency, vector operations in the NVECTOR_CUDA implementation that have more than one N_Vector argument do not check for consistent internal representations of these vectors. It is the user’s responsibility to ensure that such routines are called with N_Vector arguments that were all created with the same internal representations.

The SUNCudaExecPolicy Class

In order to provide maximum flexibility to users, the CUDA kernel execution parameters used by kernels within SUNDIALS are defined by objects of the sundials::CudaExecPolicy abstract class type (this class can be accessed in the global namespace as SUNCudaExecPolicy). Thus, users may provide custom execution policies that fit the needs of their problem. The sundials::CudaExecPolicy is defined in the header file sundials_cuda_policies.hpp, as follows:

class CudaExecPolicy
{
public:
   virtual size_t gridSize(size_t numWorkUnits = 0, size_t blockDim = 0) const = 0;
   virtual size_t blockSize(size_t numWorkUnits = 0, size_t gridDim = 0) const = 0;
   virtual cudaStream_t stream() const = 0;
   virtual CudaExecPolicy* clone() const = 0;
   virtual ~CudaExecPolicy() {}
};

To define a custom execution policy, a user simply needs to create a class that inherits from the abstract class and implements the methods. The SUNDIALS provided sundials::CudaThreadDirectExecPolicy (aka in the global namespace as SUNCudaThreadDirectExecPolicy) class is a good example of a what a custom execution policy may look like:

class CudaThreadDirectExecPolicy : public CudaExecPolicy
{
public:
   CudaThreadDirectExecPolicy(const size_t blockDim, const cudaStream_t stream = 0)
      : blockDim_(blockDim), stream_(stream)
   {}

   CudaThreadDirectExecPolicy(const CudaThreadDirectExecPolicy& ex)
      : blockDim_(ex.blockDim_), stream_(ex.stream_)
   {}

   virtual size_t gridSize(size_t numWorkUnits = 0, size_t blockDim = 0) const
   {
      return (numWorkUnits + blockSize() - 1) / blockSize();
   }

   virtual size_t blockSize(size_t numWorkUnits = 0, size_t gridDim = 0) const
   {
      return blockDim_;
   }

   virtual cudaStream_t stream() const
   {
      return stream_;
   }

   virtual CudaExecPolicy* clone() const
   {
      return static_cast<CudaExecPolicy*>(new CudaThreadDirectExecPolicy(*this));
   }

private:
   const cudaStream_t stream_;
   const size_t blockDim_;
};

In total, SUNDIALS provides 3 execution policies:

  1. SUNCudaThreadDirectExecPolicy(const size_t blockDim, const cudaStream_t stream = 0) maps each CUDA thread to a work unit. The number of threads per block (blockDim) can be set to anything. The grid size will be calculated so that there are enough threads for one thread per element. If a CUDA stream is provided, it will be used to execute the kernel.
  2. SUNCudaGridStrideExecPolicy(const size_t blockDim, const size_t gridDim, const cudaStream_t stream = 0) is for kernels that use grid stride loops. The number of threads per block (blockDim) can be set to anything. The number of blocks (gridDim) can be set to anything. If a CUDA stream is provided, it will be used to execute the kernel.
  3. SUNCudaBlockReduceExecPolicy(const size_t blockDim, const cudaStream_t stream = 0) is for kernels performing a reduction across indvidual thread blocks. The number of threads per block (blockDim) can be set to any valid multiple of the CUDA warp size. The grid size (gridDim) can be set to any value greater than 0. If it is set to 0, then the grid size will be chosen so that there is enough threads for one thread per work unit. If a CUDA stream is provided, it will be used to execute the kernel.

For example, a policy that uses 128 threads per block and a user provided stream can be created like so:

cudaStream_t stream;
cudaStreamCreate(&stream);
SUNCudaThreadDirectExecPolicy thread_direct(128, stream);

These default policy objects can be reused for multiple SUNDIALS data structures (e.g. a SUNMatrix and an N_Vector) since they do not hold any modifiable state information.