BATCHED – KokkosKernels batched host-level interfaces
BatchedGemm
-
template<typename ArgTransA, typename ArgTransB, typename ArgBatchSzDim, typename BatchedGemmHandleType, typename ScalarType, typename AViewType, typename BViewType, typename CViewType>
inline int KokkosBatched::BatchedGemm(BatchedGemmHandleType *const handle, const ScalarType alpha, const AViewType &A, const BViewType &B, const ScalarType beta, const CViewType &C) Non-blocking solve of general matrix multiply on a batch of uniform matrices.
Note: If a TPL is selected, this interface follows the blocking behavior (either blocking or non-blocking) of the TPL vendor’s API.
Note: To leverage SIMD instructions, 4-rank views must be selected via the template parameters documented below.
Usage Example: BatchedGemm<ArgTransA, ArgTransB, ArgBatchSzDim>(handle, alpha, A, B, beta, C);C = alpha * op(A) * op(B) + beta * C
- Template Parameters:
ArgTransA – Specifies what op does to A:
Trans::NoTranspose for non-transpose Trans::Transpose for transpose Trans::ConjTranspose for conjugate transpose
ArgTransB – Specifies what op does to B:
Trans::NoTranspose for non-transpose Trans::Transpose for transpose Trans::ConjTranspose for conjugate transpose
ArgBatchSzDim – Specifies where the batch dimension is allocated in
AViewType, BViewType, and CViewType: BatchLayout::Left Batch dimension is leftmost BatchLayout::Right Batch dimension is rightmost
ScalarType – Specifies the scalar type of alpha and beta
AViewType – Input matrix, as either a 3-rank Kokkos::View or a 4-rank Kokkos::View for SIMD operations.
BViewType – Input matrix, as either a 3-rank Kokkos::View or a 4-rank Kokkos::View for SIMD operations.
CViewType – Input(RHS)/Output(LHS) matrix, as either a 3-rank Kokkos::View or a 4-rank Kokkos::View for SIMD operations.
- Parameters:
handle – [in] A handle which specifies how to invoke the batched gemm. See struct BatchedGemmHandle for details.
alpha – [in] Input coefficient used for multiplication with A
A – [in] Input matrix, as a 3-rank Kokkos::View
If ArgBatchSzDim == "BatchLayout::Right", matrix A is MxKxB If ArgBatchSzDim == "BatchLayout::Left", matrix A is BxMxK
B – [in] Input matrix, as a 3-rank Kokkos::View
If ArgBatchSzDim == "BatchLayout::Right", matrix B is KxNxB If ArgBatchSzDim == "BatchLayout::Left", matrix B is BxKxN
beta – [in] Input coefficient used for multiplication with C
C – [in/out] Input/Output matrix, as a 3-rank Kokkos::View
If ArgBatchSzDim == "BatchLayout::Right", matrix C is MxNxB If ArgBatchSzDim == "BatchLayout::Left", matrix C is BxMxN
- Returns:
0 upon success, non-zero otherwise
-
class BatchedGemmHandle : public KokkosBatched::BatchedKernelHandle
Handle for selecting runtime behavior of the BatchedGemm interface.
- Param kernelAlgoType:
Specifies which algorithm to use for invocation (default, SQUARE).
Specifies whether to select optimal invocations based on inputs and heuristics: SQUARE select invocations based on square matrix heuristics where M=N TALL select invocations based on tall matrix heuristics where M>N WIDE select invocations based on wide matrix heuristics where M<N Specifies which cmake-enabled TPL algorithm to invoke: ARMPL Invoke the ArmPL TPL interface (Currently UNSUPPORTED) MKL Invoke the MKL TPL interface (Currently UNSUPPORTED) CUBLAS Invoke the CuBLAS TPL interface (Currently UNSUPPORTED) MAGMA Invoke the Magma TPL interface (Currently UNSUPPORTED) Note: Requires that input views for A, B, and C reside on either host or device depending on the TPL selected. Note: If the user selects a TPL, an error will be thrown if: 1. The TPL is not enabled via cmake 2. The input views do not reside on the host/device as needed Specifies which kokkos-kernels (KK) algorithm to invoke: KK_SERIAL Invoke SerialGemm via RangePolicy(BatchSz) KK_TEAM Invoke TeamGemm via TeamPolicy(BatchSz) KK_TEAMVECTOR Invoke TeamVectorGemm via TeamPolicy(BatchSz) KK_SERIALSIMD Invoke SerialGemm via TeamPolicy(BatchSz) KK_TEAMSIMD Invoke TeamGemm via TeamPolicy(BatchSz) KK_SERIAL_RANK0 Invoke SerialGemm via RangePolicy(BatchSz*N*M) Each thread computes one element of C. KK_SERIAL_SHMEM Invoke SerialGemm via TeamPolicy(BatchSz) Copies A and B to shared memory before GEMM. Each vector lane solves one element of C via SerialGemm. KK_DBLBUF Solve GEMM via TeamPolicy(BatchSz*TILES) Uses a tuned functor with tiling and double buffering via shared memory and register buffers. KK_DBLBUF generally performs better on GPUs when M, N >= 24.
- Param teamSz:
Specifies the team size that will affect any KK algorithm which uses TeamPolicy (default, Kokkos::AUTO). Note: Only applied if useAlgo_type == KK_*
- Param vecLen:
Specifies the vector length that will affect any KK algorithm which uses TeamPolicy and Kokkos::ThreadVectorRange or Kokkos::TeamVectorRange (default, Kokkos::AUTO). Note: Only applied if useAlgo_type == KK_*