The NVECTOR_CUDA module is an NVECTOR implementation in the CUDA language. The module allows for SUNDIALS vector kernels to run on NVIDIA GPU devices. It is intended for users who are already familiar with CUDA and GPU programming. Building this vector module requires a CUDA compiler and, by extension, a C++ compiler. The vector content layout is as follows:
struct _N_VectorContent_Cuda
{
sunindextype length;
booleantype own_data;
realtype* host_data;
realtype* device_data;
SUNCudaExecPolicy* stream_exec_policy;
SUNCudaExecPolicy* reduce_exec_policy;
void* priv; /* 'private' data */
};
typedef struct _N_VectorContent_Cuda *N_VectorContent_Cuda;
The content members are the vector length (size), a boolean flag that signals if
the vector owns the data (i.e. it is in charge of freeing the data), pointers to
vector data on the host and the device, pointers to SUNCudaExecPolicy
implementations that control how the CUDA kernels are launched for streaming and
reduction vector kernels, and a private data structure which holds additonal members
that should not be accessed directly.
When instantiated with N_VNew_Cuda
, the underlying data will be allocated
memory on both the host and the device. Alternatively, a user can provide host
and device data arrays by using the N_VMake_Cuda
constructor. To use CUDA
managed memory, the constructors N_VNewManaged_Cuda
and
N_VMakeManaged_Cuda
are provided. Details on each of these constructors
are provided below.
To use the NVECTOR_CUDA module, include nvector_cuda.h
and link to
the library libsundials_nveccuda.lib
. The extension, .lib
, is
typically .so
for shared libraries and .a
for static libraries.
Unlike other native SUNDIALS vector types, the NVECTOR_CUDA module does not provide macros to access its member variables. Instead, user should use the accessor functions:
N_VGetHostArrayPointer_Cuda
(N_Vector v)¶This function returns pointer to the vector data on the host.
N_VGetDeviceArrayPointer_Cuda
(N_Vector v)¶This function returns pointer to the vector data on the device.
N_VIsManagedMemory_Cuda
(N_Vector v)¶This function returns a boolean flag indiciating if the vector data array is in managed memory or not.
The NVECTOR_CUDA module defines implementations of all standard vector
operations defined in the sections Description of the NVECTOR operations, Description of the NVECTOR fused operations,
Description of the NVECTOR vector array operations, and Description of the NVECTOR local reduction operations, except for
N_VSetArrayPointer
, and, if using unmanaged memory, N_VGetArrayPointer
.
As such, this vector can only be used with SUNDIALS Fortran interfaces, and the
SUNDIALS direct solvers and preconditioners when using managed memory.
The NVECTOR_CUDA module provides separate functions to access data on the host
and on the device for the unmanaged memory use case. It also provides methods for
copying from the host to the device and vice versa. Usage examples of NVECTOR_CUDA
are provided in example programs for CVODE [HSR2017].
The names of vector operations are obtained from those in the sections
Description of the NVECTOR operations, Description of the NVECTOR fused operations, Description of the NVECTOR vector array operations, and
Description of the NVECTOR local reduction operations by appending the suffix _Cuda
(e.g. N_VDestroy_Cuda
). The module NVECTOR_CUDA provides the
following additional user-callable routines:
N_VNew_Cuda
(sunindextype length)¶This function creates and allocates memory for a CUDA N_Vector
.
The vector data array is allocated on both the host and device.
N_VNewManaged_Cuda
(sunindextype vec_length)¶This function creates and allocates memory for a CUDA
N_Vector
. The vector data array is allocated in managed memory.
N_VNewEmpty_Cuda
(sunindextype vec_length)¶This function creates a new N_Vector
wrapper with the pointer
to the wrapped CUDA vector set to NULL
. It is used by
N_VNew_Cuda()
, N_VMake_Cuda()
, and
N_VClone_Cuda()
implementations.
N_VMake_Cuda
(sunindextype vec_length, realtype *h_vdata, realtype *d_vdata)¶This function creates a CUDA N_Vector
with user-supplied vector data arrays
for the host and the device.
N_VMakeManaged_Cuda
(sunindextype vec_length, realtype *vdata)¶This function creates a CUDA N_Vector
with a user-supplied
managed memory data array.
N_VMakeWithManagedAllocator_Cuda
(sunindextype length, void* (*allocfn)(size_t size), void (*freefn)(void* ptr))¶This function creates a CUDA N_Vector
with a user-supplied memory allocator.
It requires the user to provide a corresponding free function as well.
The memory allocated by the allocator function must behave like CUDA managed memory.
The module NVECTOR_CUDA also provides the following user-callable routines:
void N_VSetKernelExecPolicy_Cuda(N_Vector v,
SUNCudaExecPolicy* stream_exec_policy,
SUNCudaExecPolicy* reduce_exec_policy)
This function sets the execution policies which control the kernel parameters
utilized when launching the streaming and reduction CUDA kernels. By default
the vector is setup to use the SUNCudaThreadDirectExecPolicy
and
SUNCudaBlockReduceExecPolicy
. Any custom execution policy for reductions
must ensure that the grid dimensions (number of thread blocks) is a multiple of
the CUDA warp size (32). See section The SUNCudaExecPolicy Class
below for more information about the SUNCudaExecPolicy
class.
Note: All vectors used in a single instance of a {sundials} solver must use the same execution policy. It is **strongly recommended* that this function is called immediately after constructing the vector, and any subsequent vector be created by cloning to ensure consistent execution policies across vectors*
N_VSetCudaStream_Cuda
(N_Vector v, cudaStream_t *stream)¶DEPRECATED This function will be removed in the next major release,
user should utilize the N_VSetKernelExecPolicy_Cuda
function instead.
This function sets the CUDA stream that all vector kernels will be launched on. By default an NVECTOR_CUDA uses the default CUDA stream.
Note: All vectors used in a single instance of a {sundials} solver must use the same CUDA stream. It is **strongly recommended* that this function is called immediately after constructing the vector, and any subsequent vector be created by cloning to ensure consistent execution policies across vectors*
N_VCopyToDevice_Cuda
(N_Vector v)¶This function copies host vector data to the device.
N_VCopyFromDevice_Cuda
(N_Vector v)¶This function copies vector data from the device to the host.
N_VPrint_Cuda
(N_Vector v)¶This function prints the content of a CUDA vector to stdout
.
N_VPrintFile_Cuda
(N_Vector v, FILE *outfile)¶This function prints the content of a CUDA vector to outfile
.
By default all fused and vector array operations are disabled in the NVECTOR_CUDA
module. The following additional user-callable routines are provided to
enable or disable fused and vector array operations for a specific vector. To
ensure consistency across vectors it is recommended to first create a vector
with N_VNew_Cuda()
, enable/disable the desired operations for that vector
with the functions below, and create any additional vectors from that vector
using N_VClone()
. This guarantees the new vectors will have the same
operations enabled/disabled as cloned vectors inherit the same enable/disable
options as the vector they are cloned from while vectors created with
N_VNew_Cuda()
will have the default settings for the NVECTOR_CUDA module.
N_VEnableFusedOps_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) all fused and
vector array operations in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableLinearCombination_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the linear
combination fused operation in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableScaleAddMulti_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the scale and
add a vector to multiple vectors fused operation in the CUDA vector. The
return value is 0
for success and -1
if the input vector or its
ops
structure are NULL
.
N_VEnableDotProdMulti_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the multiple
dot products fused operation in the CUDA vector. The return value is 0
for success and -1
if the input vector or its ops
structure are
NULL
.
N_VEnableLinearSumVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the linear sum
operation for vector arrays in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableScaleVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the scale
operation for vector arrays in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableConstVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the const
operation for vector arrays in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableWrmsNormVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the WRMS norm
operation for vector arrays in the CUDA vector. The return value is 0
for
success and -1
if the input vector or its ops
structure are NULL
.
N_VEnableWrmsNormMaskVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the masked WRMS
norm operation for vector arrays in the CUDA vector. The return value is
0
for success and -1
if the input vector or its ops
structure are
NULL
.
N_VEnableScaleAddMultiVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the scale and
add a vector array to multiple vector arrays operation in the CUDA vector. The
return value is 0
for success and -1
if the input vector or its
ops
structure are NULL
.
N_VEnableLinearCombinationVectorArray_Cuda
(N_Vector v, booleantype tf)¶This function enables (SUNTRUE
) or disables (SUNFALSE
) the linear
combination operation for vector arrays in the CUDA vector. The return value
is 0
for success and -1
if the input vector or its ops
structure
are NULL
.
Notes
N_Vector_Cuda
, v
,
it is recommeded to use functions N_VGetDeviceArrayPointer_Cuda()
or
N_VGetHostArrayPointer_Cuda()
. However, when using managed memory,
the function N_VGetArrayPointer()
may also be used.N_Vector
argument do not check for
consistent internal representations of these vectors. It is the user’s
responsibility to ensure that such routines are called with N_Vector
arguments that were all created with the same internal representations.SUNCudaExecPolicy
Class¶In order to provide maximum flexibility to users, the CUDA kernel execution parameters used
by kernels within SUNDIALS are defined by objects of the sundials::CudaExecPolicy
abstract class type (this class can be accessed in the global namespace as SUNCudaExecPolicy
).
Thus, users may provide custom execution policies that fit the needs of their problem. The
sundials::CudaExecPolicy
is defined in the header file sundials_cuda_policies.hpp
,
as follows:
class CudaExecPolicy
{
public:
virtual size_t gridSize(size_t numWorkUnits = 0, size_t blockDim = 0) const = 0;
virtual size_t blockSize(size_t numWorkUnits = 0, size_t gridDim = 0) const = 0;
virtual cudaStream_t stream() const = 0;
virtual CudaExecPolicy* clone() const = 0;
virtual ~CudaExecPolicy() {}
};
To define a custom execution policy, a user simply needs to create a class that inherits from
the abstract class and implements the methods. The SUNDIALS provided
sundials::CudaThreadDirectExecPolicy
(aka in the global namespace as
SUNCudaThreadDirectExecPolicy
) class is a good example of a what a custom execution policy
may look like:
class CudaThreadDirectExecPolicy : public CudaExecPolicy
{
public:
CudaThreadDirectExecPolicy(const size_t blockDim, const cudaStream_t stream = 0)
: blockDim_(blockDim), stream_(stream)
{}
CudaThreadDirectExecPolicy(const CudaThreadDirectExecPolicy& ex)
: blockDim_(ex.blockDim_), stream_(ex.stream_)
{}
virtual size_t gridSize(size_t numWorkUnits = 0, size_t blockDim = 0) const
{
return (numWorkUnits + blockSize() - 1) / blockSize();
}
virtual size_t blockSize(size_t numWorkUnits = 0, size_t gridDim = 0) const
{
return blockDim_;
}
virtual cudaStream_t stream() const
{
return stream_;
}
virtual CudaExecPolicy* clone() const
{
return static_cast<CudaExecPolicy*>(new CudaThreadDirectExecPolicy(*this));
}
private:
const cudaStream_t stream_;
const size_t blockDim_;
};
In total, SUNDIALS provides 3 execution policies:
SUNCudaThreadDirectExecPolicy(const size_t blockDim, const cudaStream_t stream = 0)
maps each CUDA thread to a work unit. The number of threads per block (blockDim) can be set
to anything. The grid size will be calculated so that there are enough threads for one
thread per element. If a CUDA stream is provided, it will be used to execute the kernel.SUNCudaGridStrideExecPolicy(const size_t blockDim, const size_t gridDim, const cudaStream_t stream = 0)
is for kernels that use grid stride loops. The number of threads per block (blockDim)
can be set to anything. The number of blocks (gridDim) can be set to anything. If a
CUDA stream is provided, it will be used to execute the kernel.SUNCudaBlockReduceExecPolicy(const size_t blockDim, const cudaStream_t stream = 0)
is for kernels performing a reduction across indvidual thread blocks. The number of threads
per block (blockDim) can be set to any valid multiple of the CUDA warp size. The grid size
(gridDim) can be set to any value greater than 0. If it is set to 0, then the grid size
will be chosen so that there is enough threads for one thread per work unit. If a
CUDA stream is provided, it will be used to execute the kernel.For example, a policy that uses 128 threads per block and a user provided stream can be created like so:
cudaStream_t stream;
cudaStreamCreate(&stream);
SUNCudaThreadDirectExecPolicy thread_direct(128, stream);
These default policy objects can be reused for multiple SUNDIALS data structures
(e.g. a SUNMatrix
and an N_Vector
) since they do not hold any modifiable
state information.