BATCHED – KokkosKernels batched functor-level interfaces

innerlu

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerLU_Internal.hpp

applypivot

template<typename MemberType, typename ArgSide, typename ArgDirect>
struct TeamVectorApplyPivot

TeamVector

qr_withcolumnpivoting

template<typename MemberType, typename ArgAlgo>
struct TeamVectorQR_WithColumnPivoting

TeamVector QR

addradial

struct SerialAddRadial

This add tiny values on diagonals so the absolute values of diagonals become larger Serial AddRadial

template<typename MemberType>
struct TeamAddRadial

Team Set

householder

template<typename ArgSide>
struct SerialHouseholder

Serial Householder

template<typename MemberType, typename ArgSide>
struct TeamVectorHouseholder

TeamVector Householder

set

struct SerialSet

Serial Set

template<typename MemberType>
struct TeamSet

Team Set

template<typename MemberType>
struct TeamVectorSet

TeamVector Set

scale

struct SerialScale

Serial Scale

template<typename MemberType>
struct TeamScale

Team Scale

template<typename MemberType>
struct TeamVectorScale

TeamVector Scale

setidentity

struct SerialSetIdentity

Serial SetIdentity

template<typename MemberType>
struct TeamSetIdentity

Team Set

template<typename MemberType, typename ArgMode>
struct SetIdentity

Selective Interface

applyhouseholder

template<typename ArgSide>
struct SerialApplyHouseholder

Serial Householder

template<typename MemberType, typename ArgSide>
struct TeamVectorApplyHouseholder

innermultipledotproduct

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerMultipleDotProduct_Internal.hpp

lu

template<typename ArgAlgo>
struct SerialLU
template<typename MemberType, typename ArgAlgo>
struct TeamLU
template<typename MemberType, typename ArgMode, typename ArgAlgo>
struct LU

Selective Interface

solveutv

template<typename MemberType, typename ArgAlgo>
struct TeamVectorSolveUTV

For given UTV = A P^T, it solves A X = B

  • input:

    • matrix_rank is computed while UTV factorization

    • U is m x m real matrix (m x matrix_rank is only used)

    • T is m x m real matrix (matrix_rank x matrix_rank is only used)

    • V is m x m real matrix (matrix_Rank x m is only used)

    • p is m integer vector including pivot indicies

    • X is a solution matrix (or vector)

    • B is a right hand side matrix (or vector)

    • w is B.span() real vector workspace (contiguous)

  • output:

    • B is overwritten with its solutions

When A is a full rank i.e., matrix_rank == m, UTV computes QR with column pivoting only where Q is stored in U and R is stored in T TeamVector Solve UTV

utv

template<typename MemberType, typename ArgAlgo>
struct TeamVectorUTV

For given A, it performs UTV factorization i.e., UTV = A P^T

  • input:

    • A is m x m real matrix

    • p is m integer vector

    • U is m x m real matrix

    • V is m x m real matrix

    • w is 3*m real vector workspace (contiguous)

  • output:

    • A is overwritten as lower triangular matrix_rank x matrix_rank real matrix

    • P^T includes pivot indicies (note that this is different from permutation indicies)

    • U is left orthogonal matrix m x matrix_rank

    • V is right orthogonal matrix matrix_rank x m

When A is a full rank i.e., matrix_rank == m, this only compute a QR with column pivoting

  • output:

    • A is overwritten as upper triangular matrix

    • P^T includes pivot indicies (note that this is different from permutation indicies)

    • U is an orthogonal matrix m x m

    • V is not touched

For the solution of a rank-deficient problem, it is recommended to use SolveUTV. TeamVector UTV

inverselu

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InverseLU_Internal.hpp

svd

struct SerialSVD

eigendecomposition

struct SerialEigendecomposition

Given a general nonsymmetric matrix A (m x m), it performs eigendecomposition of the matrix.

Parameters: [in] member Team interface only has this argument. Partial specialization can be applied for a different type of team member. [in/out]A Real general nonsymmetric rank 2 view A(m x m). A is first condensed to a upper Hessenberg form. Then, the Francis double shift QR algorithm is applied to compute its Schur form. On exit, A stores a quasi upper triangular matrix of the Schur decomposition. [out]er, [out]ei A real and imaginary eigenvalues, which forms er(m)+ei(m)i For a complex eigen pair, it stores a+bi and a-bi consecutively. [out]UL, [out]UR Left/right eigenvectors are stored in (m x m) matrices. If zero span view is provided, it does not compute the corresponding eigenvectors. However, both UL and UR cannot have zero span. If eigenvalues are only requested, use the Eigenvalue interface which simplifies computations [out]W 1D contiguous workspace. The minimum size is (2*m*m+5*m) where m is the dimension of matrix A.

template<typename MemberType>
struct TeamVectorEigendecomposition

trtri

template<typename ArgUplo, typename ArgDiag, typename ArgAlgo>
struct SerialTrtri

qr

template<typename ArgAlgo>
struct SerialQR

Serial QR

template<typename MemberType, typename ArgAlgo>
struct TeamQR

Team QR

template<typename MemberType, typename ArgAlgo>
struct TeamVectorQR

TeamVector QR

template<typename MemberType, typename ArgMode, typename ArgAlgo>
struct QR

Selective Interface

trmm

template<typename ArgSide, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct SerialTrmm

trsm

template<typename ArgSide, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct SerialTrsm
template<typename MemberType, typename ArgSide, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct TeamTrsm
template<typename MemberType, typename ArgSide, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct TeamVectorTrsm
template<typename MemberType, typename ArgSide, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgMode, typename ArgAlgo>
struct Trsm

Selective Interface

innergemmfixa

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerGemmFixA_Internal.hpp

innergemmfixb

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerGemmFixB_Internal.hpp

innergemmfixc

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerGemmFixC_Internal.hpp

applyq

template<typename ArgSide, typename ArgTrans, typename ArgAlgo>
struct SerialApplyQ

Serial ApplyQ

template<typename MemberType, typename ArgSide, typename ArgTrans, typename ArgAlgo>
struct TeamApplyQ

Team ApplyQ

template<typename MemberType, typename ArgSide, typename ArgTrans, typename ArgAlgo>
struct TeamVectorApplyQ

TeamVector ApplyQ

template<typename MemberType, typename ArgSide, typename ArgTrans, typename ArgMode, typename ArgAlgo>
struct ApplyQ

Selective Interface

copy

template<typename ArgTrans = Trans::NoTranspose, int rank = 2>
struct SerialCopy

Serial Copy

template<typename MemberType, typename ArgTrans = Trans::NoTranspose, int rank = 2>
struct TeamCopy

Team Copy

template<typename MemberType, typename ArgTrans = Trans::NoTranspose, int rank = 2>
struct TeamVectorCopy

TeamVector Copy

template<typename MemberType, typename ArgTrans, typename ArgMode, int rank = 2>
struct Copy

Selective Interface

innertrsm

CodeCleanup-TODO: Move Decl file to dense/impl/KokkosBatched_InnerTrsm_Internal.hpp

solvelu

template<typename ArgTrans, typename ArgAlgo>
struct SerialSolveLU
template<typename MemberType, typename ArgTrans, typename ArgAlgo>
struct TeamSolveLU
template<typename MemberType, typename ArgTrans, typename ArgMode, typename ArgAlgo>
struct SolveLU

Selective Interface

xpay

struct SerialXpay

Serial Batched XPAY: y_l <- x_l + alpha_l * y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for y_1, …, y_N.

No nested parallel_for is used inside of the function.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param alpha:

[in]: input coefficient for Y, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

template<typename MemberType>
struct TeamXpay

Team Batched XPAY: y_l <- x_l + alpha_l * y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for y_1, …, y_N.

A nested parallel_for with TeamThreadRange is used.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param member:

[in]: TeamPolicy member

Param alpha:

[in]: input coefficient for Y, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

template<typename MemberType>
struct TeamVectorXpay

TeamVector Batched XPAY: y_l <- x_l + alpha_l * y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for y_1, …, y_N.

Two nested parallel_for with both TeamThreadRange and ThreadVectorRange (or one with TeamVectorRange) are used inside.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param member:

[in]: TeamPolicy member

Param alpha:

[in]: input coefficient for Y, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

axpy

struct SerialAxpy

Serial Batched AXPY: y_l <- alpha_l * x_l + y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for x_1, …, x_N.

No nested parallel_for is used inside of the function.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param alpha:

[in]: input coefficient for X, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

template<typename MemberType>
struct TeamAxpy

Team Batched AXPY: y_l <- alpha_l * x_l + y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for x_1, …, x_N.

A nested parallel_for with TeamThreadRange is used.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param member:

[in]: TeamPolicy member

Param alpha:

[in]: input coefficient for X, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

template<typename MemberType>
struct TeamVectorAxpy

TeamVector Batched AXPY: y_l <- alpha_l * x_l + y_l for all l = 1, …, N where:

  • N is the number of vectors,

  • x_1, …, x_N are the N input vectors,

  • y_1, …, y_N are the N output vectors,

  • alpha_1, …, alpha_N are N scaling factors for x_1, …, x_N.

Two nested parallel_for with both TeamThreadRange and ThreadVectorRange (or one with TeamVectorRange) are used inside.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param member:

[in]: TeamPolicy member

Param alpha:

[in]: input coefficient for X, a rank 1 view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in/out]: Output vector Y, a rank 2 view

gemv

template<typename ArgTrans, typename ArgAlgo>
struct SerialGemv

Serial Gemv

template<typename MemberType, typename ArgTrans, typename ArgAlgo>
struct TeamGemv

Team Gemv

template<typename MemberType, typename ArgTrans, typename ArgAlgo>
struct TeamVectorGemv

TeamVector Gemv

template<typename MemberType, typename ArgTrans, typename ArgMode, typename ArgAlgo>
struct Gemv

Selective Interface

dot

template<typename ArgTrans = Trans::NoTranspose>
struct SerialDot

Serial Batched DOT:

Depending on the ArgTrans template, the dot product is row-based (ArgTrans == Trans::NoTranspose):

dot_l <- (x_l:, y_l:) for all l = 1, …, N where:

  • N is the second dimension of X.

Or column-based: dot_l <- (x_:l, y_:l) for all l = 1, …, n where:

  • n is the second dimension of X.

No nested parallel_for is used inside of the function.

Template Parameters:
  • ArgTrans – type of dot product (Trans::NoTranspose by default)

  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param dot:

[out]: Computed dot product, a rank 1 view

template<typename MemberType, typename ArgTrans = Trans::NoTranspose>
struct TeamDot

Team Batched DOT:

Depending on the ArgTrans template, the dot product is row-based (ArgTrans == Trans::NoTranspose):

dot_l <- (x_l:, y_l:) for all l = 1, …, N where:

  • N is the second dimension of X.

Or column-based: dot_l <- (x_:l, y_:l) for all l = 1, …, n where:

  • n is the second dimension of X.

A nested parallel_for with TeamThreadRange is used.

Template Parameters:
  • ArgTrans – type of dot product (Trans::NoTranspose by default)

  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param dot:

[out]: Computed dot product, a rank 1 view

template<typename MemberType, typename ArgTrans = Trans::NoTranspose>
struct TeamVectorDot

TeamVector Batched DOT:

Depending on the ArgTrans template, the dot product is row-based (ArgTrans == Trans::NoTranspose):

dot_l <- (x_l:, y_l:) for all l = 1, …, N where:

  • N is the second dimension of X.

Or column-based: dot_l <- (x_:l, y_:l) for all l = 1, …, n where:

  • n is the second dimension of X.

Two nested parallel_for with both TeamThreadRange and ThreadVectorRange (or one with TeamVectorRange) are used inside.

Template Parameters:
  • ArgTrans – type of dot product (Trans::NoTranspose by default)

  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • alphaViewType – Input type for alpha, needs to be a 1D view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param dot:

[out]: Computed dot product, a rank 1 view

hadamardproduct

struct SerialHadamardProduct

Serial Batched Hadamard Product: v_ij <- x_ij * y_ij for all i = 1, …, n and j = 1, …, N where:

  • n is the number of rows,

  • N is the number of vectors.

No nested parallel_for is used inside of the function.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • VViewType – Input type for V, needs to be a 2D view

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param V:

[out]: Output vector V, a rank 2 view

template<typename MemberType>
struct TeamHadamardProduct

Team Batched Hadamard Product: v_ij <- x_ij * y_ij for all i = 1, …, n and j = 1, …, N where:

  • n is the number of rows,

  • N is the number of vectors.

A nested parallel_for with TeamThreadRange is used.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • VViewType – Input type for V, needs to be a 2D view

Param member:

[in]: TeamPolicy member

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param V:

[out]: Output vector V, a rank 2 view

template<typename MemberType>
struct TeamVectorHadamardProduct

TeamVector Batched Hadamard Product: v_ij <- x_ij * y_ij for all i = 1, …, n and j = 1, …, N where:

  • n is the number of rows,

  • N is the number of vectors.

Two nested parallel_for with both TeamThreadRange and ThreadVectorRange (or one with TeamVectorRange) are used inside.

Template Parameters:
  • XViewType – Input type for X, needs to be a 2D view

  • YViewType – Input type for Y, needs to be a 2D view

  • VViewType – Input type for V, needs to be a 2D view

Param member:

[in]: TeamPolicy member

Param X:

[in]: Input vector X, a rank 2 view

Param Y:

[in]: Input vector Y, a rank 2 view

Param V:

[out]: Output vector V, a rank 2 view

template<typename MemberType, typename ArgMode>
struct HadamardProduct

vector

CodeCleanup-TODO: Move Decl file to dense/impl/

trsv

template<typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct SerialTrsv

Serial Trsv

template<typename MemberType, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct TeamTrsv

Team Trsv

template<typename MemberType, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgAlgo>
struct TeamVectorTrsv

TeamVector Trsv

template<typename MemberType, typename ArgUplo, typename ArgTrans, typename ArgDiag, typename ArgMode, typename ArgAlgo>
struct Trsv

Selective Interface

gemm

template<typename ArgTransA, typename ArgTransB, typename ArgAlgo>
struct SerialGemm

Serial Gemm

template<typename MemberType, typename ArgTransA, typename ArgTransB, typename ArgAlgo>
struct TeamGemm

Team Gemm

template<typename MemberType, typename ArgTransA, typename ArgTransB, typename ArgAlgo>
struct TeamVectorGemm

TeamVector Gemm

template<typename MemberType, typename ArgTransA, typename ArgTransB, typename ArgMode, typename ArgAlgo>
struct Gemm

Selective Interface