Name Strings
To report problems with this extension, please open a new issue at:
Jeff Bolz, NVIDIA
Karthik Vaidyanathan, NVIDIA
Copyright (c) 2024 NVIDIA Corp.
Last Modified Date |
2024-09-18 |
Revision |
1 |
This extension is written against the SPIR-V Specification, Version 1.6, Revision 3, Unified.
This extension requires SPIR-V 1.6.
This extension requires SPV_KHR_cooperative_matrix.
If CooperativeMatrixTensorAddressingNV is used, SPV_NV_tensor_addressing is required.
This extension adds several new features building on the cooperative matrix types added in SPV_KHR_cooperative_matrix. The goal is to add and accelerate features beyond just simple GEMM kernels, including adding support for type/use conversions, reductions, per-element operations, and tensor addressing, and also to improve usability and out-of-the-box performance by adding support for more flexible matrix sizes, and workgroup scope matrices with compiler-managed staging through shared memory.
Extension Name
To use this extension within a SPIR-V module, the following OpExtension must be present in the module:
OpExtension "SPV_NV_cooperative_matrix2"
Modifications to the SPIR-V Specification, Version 1.6
2.16 Validation Rules
Modify section 2.16.1. Universal Validation Rules:
Add OpCooperativeMatrixLoadTensorNV and OpCooperativeMatrixStoreTensorNV to the list of instructions under "It is invalid for a pointer to be an operand to any instruction other than:", when the Logical addressing model is selected and neither the VariablePointers nor VariablePointersStorageBuffer capability are declared.
If an OpTypeCooperativeMatrixKHR instruction uses a Scope of Workgroup, then the workgroup size must have already been specified in the module, including any constant instructions used by LocalSizeId.
In any function used as a DecodeFunc parameter to OpCooperativeMatrixLoadTensorNV or as a Func parameter to OpCooperativeMatrixPerElementOpNV or as a CombineFunc parameter to OpCooperativeMatrixReduceNV, and any function called directly or indirectly by those functions, tangled instructions are not allowed.
3.26 Memory Operands
Modify Section 3.26, "Memory Operands":
In the description of MakePointerAvailable, change "Not valid with OpLoad" to "Not valid with OpLoad or OpCooperativeMatrixLoadKHR or OpCooperativeMatrixLoadTensorNV".
In the description of MakePointerVisible, change "Not valid with OpStore" to "Not valid with OpStore or OpCooperativeMatrixStoreKHR or OpCooperativeMatrixStoreTensorNV".
3.31 Capabilities
Modify Section 3.31, "Capability", adding these rows to the Capability table:
Capability | Enabling Capabilities | |
5430 |
CooperativeMatrixReductionsNV |
5431 |
CooperativeMatrixConversionsNV |
5432 |
CooperativeMatrixPerElementOperationsNV |
5433 |
CooperativeMatrixTensorAddressingNV |
5434 |
CooperativeMatrixBlockLoadsNV |
3.X Tensor Layout and View
Tensor layout and tensor view types are representations of the mapping between matrix coordinates and tensor memory layout. They each have a number of dimensions in the range [1,5], with dimension 0 being the outermost dimension and the last dimension being the innermost. These types have the following logical state:
struct tensorLayoutNV<uint32_t Dim,
TensorClampMode Mode = TensorClampModeUndefined>
static constexpr uint32_t LDim = Dim;
static constexpr TensorClampMode clampMode = Mode;
uint32_t blockSize[LDim];
uint32_t layoutDimension[LDim];
uint32_t stride[LDim];
int32_t offset[LDim];
uint32_t span[LDim];
uint32_t clampValue;
struct tensorViewNV<uint Dim, bool hasDimensions, uint32_t p0, ..., uint32_t p<Dim-1>>
static constexpr uint32_t VDim = Dim;
static constexpr bool hasDim = hasDimensions;
static constexpr uint32_t permutation[VDim] = {p0, ..., p<Dim-1>};
uint32_t viewDimension[VDim];
uint32_t viewStride[VDim];
uint32_t clipRowOffset, clipRowSpan, clipColOffset, clipColSpan;
A tensor layout represents the layout of values in memory (number of dimensions and size), along with a region being accessed (offset and span).
| layoutDimension1 |
| |
| |
| |
| |
| |
| |
| |
| span1 |
| ----------------- |
| | | |
| | | |
| | slice | span0 |
| | | layoutDimension0|
| | | |
| offset1 | | |
| ---------------> ----------------- |
| |
| ^ |
| | |
| | |
| | offset0 |
| | |
| | |
| | |
| | |
Figure: A 2D tensor layout, and a slice selecting a region within it.
A tensor view allows reinterpreting the dimensions of the region being accessed, including changing the number of dimensions, reordering the dimensions as they are loaded or stored, and clipping the region of the matrix that is loaded or stored. Often the span will have the same number of elements as the matrix, but in some more advanced uses that may not be the case.
Loads and stores can either use just a tensor layout, or a tensor layout and tensor view. The addressing starts by treating the matrix itself as a 2D "view" and mapping the (row,col) coordinate to a 1D index. If there is only a tensor layout parameter, then that 1D index is mapped to an N-D coordinate within the slice. If there is both a tensor layout and a tensor view, then the 1D index is first mapped to a coordinate within the view, the coordinate components can be permuted, and then is converted back to a 1D index which is then run through the tensor layout addressing calculation.
The tensor view dimensions and stride can be used to do more complex addressing calculations. If the tensor view type has "hasDimensions" false, then the dimensions of the tensor layout span are used instead.
The tensor view "clip" region restricts which elements of the matrix are loaded or stored, and also affects the shape of the implicit 2D "view".
Unlike some other ML APIs, tensor layouts and views only describe addressing calculations and never involve making copies of tensors. For this reason, the functionality is slightly more limited (e.g. there’s no way to slice, then permute, then slice again).
While these calculations may look expensive in their full generality, certain calculations can be skipped when they’re not needed, and the common cases should be quite efficient.
OpTensorLayout and OpTensorView instructions operate by copying existing object state and updating the requested state and returning that as a new result. Some of these instructions initialize multiple related pieces of state, setting some to common default values, so the order of the operations matters.
For load and store functions with no TensorView parameter, an element index is computed according to the matrixCoordToTensorElement function for each (row,col) of the matrix, which has M rows and N columns. This converts the (row,col) into a row-major index, converts that index into an N-dimensional coord relative to the span, and uses the span coordinate to compute a location within the tensor.
constexpr uint32_t MAX_DIM = 5;
using Coord = array<uint32_t, MAX_DIM>;
uint32_t matrixCoordToLinear(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = row * N + col;
return index;
Coord linearToSpanCoord(tensorLayoutNV t, uint32_t index)
Coord spanCoord {};
for (int32_t dim = t.LDim-1; dim >= 0; --dim) {
spanCoord[dim] = index % t.span[dim];
index /= t.span[dim];
return spanCoord;
auto spanCoordToTensorCoord(tensorLayoutNV t, Coord spanCoord)
Coord blockCoord {};
Coord coordInBlock {};
for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
int32_t c = spanCoord[dim] + t.offset[dim];
if (c < 0 || c >= t.layoutDimension[dim]) {
ClampMode clampMode = t.clampMode;
// For stores, other than Undefined, everything is treated as "discard"
if (operation is a store && clampMode != Undefined) {
clampMode = Constant;
// remainders are computed as defined in OpSMod
switch (clampMode) {
case Undefined:
undefined behavior;
case Constant:
For load, set result value to t.clampValue;
For store, discard the store;
terminate index calculation;
case ClampToEdge:
c = min(max(c, 0), t.layoutDimension[dim]-1);
case Repeat:
c = c % t.layoutDimension[dim];
case MirrorRepeat:
c = c % (2*t.layoutDimension[dim]-2);
c = (c >= dim) ? (2*dim-2-c) : c;
coordInBlock[dim] = c % t.blockSize[dim];
blockCoord[dim] = c / t.blockSize[dim];
return tuple(blockCoord, coordInBlock);
uint32_t tensorCoordToLinear(tensorLayoutNV t, Coord blockCoord)
uint32_t index = 0;
for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
index += blockCoord[dim] * t.stride[dim];
return index;
// map (row,col) -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElement(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = matrixCoordToLinear(t, row, col, N);
Coord spanCoord = linearToSpanCoord(t, index);
Coord blockCoord;
Coord coordInBlock;
tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);
index = tensorCoordToLinear(t, blockCoord);
return index;
This index is then multiplied by the size of the component type of the matrix and treated as a byte offset from the Pointer operand. The matrix element is loaded from or stored to this location. The Pointer must be a multiple of 16B, but the region of elements selected by the span need not be so aligned. If the OpCooperativeMatrixLoadTensorNV instruction has a decode parameter, then the blockCoord and coordInBlock arrays are passed to it as parameters.
For load and store functions with a TensorView parameter, an element index is computed according to the matrixCoordToTensorElementWithView function for each (row,col) of the matrix, where has M rows and N columns. This computes a row-major index relative to the clip region, converts that to an N-dimensional coordinate relative to the permuted view dimensions, and computes a linear index from the view coordinate, then runs through the tensor layout calculation.
uint32_t matrixCoordToLinear(tensorLayoutNV t, tensorViewNV v, uint32_t row, uint32_t col, uint32_t N)
if (row < v.clipRowOffset ||
row >= v.clipRowOffset + v.clipRowSpan ||
col < v.clipColOffset ||
col >= v.clipColOffset + v.clipColSpan) {
Load or store is skipped. For load, the matrix element is unmodified.
terminate index calculation;
row -= v.clipRowOffset;
col -= v.clipColOffset;
uint32_t width = min(N, v.clipColSpan);
uint32_t index = row * width + col;
return index;
Coord linearToViewCoord(tensorLayoutNV t, tensorViewNV v, uint32_t index)
auto &dimensions = v.hasDimensions ? v.viewDimension : t.span;
Coord viewCoord {};
for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
uint32_t i = v.permutation[dim];
viewCoord[i] = index % dimensions[i];
index /= dimensions[i];
return viewCoord;
uint32_t viewCoordToLinear(tensorLayoutNV t, tensorViewNV v, Coord viewCoord)
Coord stride {};
if (v.hasDimensions) {
stride = v.viewStride;
} else {
// set stride to match t.span
stride[v.VDim-1] = 1;
for (int32_t dim = v.VDim-2; dim >= 0; --dim) {
stride[dim] = stride[dim+1] * t.span[dim+1];
uint32_t index = 0;
for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
index += viewCoord[dim] * stride[dim];
return index;
// map (row,col) -> linear index in view -> view coordinate -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElementWithView(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = matrixCoordToLinear(t, v, row, col, N);
Coord viewCoord = linearToViewCoord(t, v, index);
index = viewCoordToLinear(t, v, viewCoord);
Coord spanCoord = linearToSpanCoord(t, index);
Coord blockCoord;
Coord coordInBlock;
tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);
index = tensorCoordToLinear(t, blockCoord);
return index;
The final result is then multiplied by the size of the component type of the matrix and treated as a byte offset from Pointer. The matrix element is loaded from or stored to this location.
For OpCooperativeMatrixLoadTensorNV instructions with a DecodeFunc operand, rather than loading a value, the function operand is invoked for each matrix element at least once. The function’s return type must match the component type of the result matrix type. The first parameter must be a pointer type with storage class PhysicalStorageBuffer, and the parameter is filled a pointer computed by multiplying the index returned by matrixCoordToTensorElement(WithView) by the size of the pointee type. The second and third parameters must each be an array of 32-bit integers whose dimension matches the tensor dimension. The second parameter is filled with the blockCoord, and the third parameter with the coordInBlock, for the matrix element being decoded. The return value is stored in the corresponding element of the result matrix.
DecodeFunc is not allowed with OpCooperativeMatrixStoreTensorNV. Similarly, a block size larger than 1 must not be used with OpCooperativeMatrixStoreTensorNV because it will lead to data races.
3.X Cooperative Matrix Reduce Mode
New section in 3 "Binary Form".
Cooperative Matrix Reduce Mode | Enabling Capabilities | |
0x1 |
Row |
0x2 |
Column |
0x4 |
2x2 |
It is invalid to combine 2x2 with Row or Column. Row and Column can be used together.
3.X Tensor Addressing Operands
New section in 3 "Binary Form".
This is a literal mask; it can be formed by combining the bits from multiple rows in the table below.
Provides additional operands to the listed memory instructions. Bits that are set indicate whether an additional operand follows, as described by the table. If there are multiple following operands indicated, they are ordered: Those indicated by smaller-numbered bits appear first. An instruction needing two masks must first provide the first mask followed by the first mask’s additional operands, and then provide the second mask followed by the second mask’s additional operands.
Used by:
Tensor Addressing Operands | Enabling Capabilities | |
0x0 |
None |
0x1 |
TensorView |
CooperativeMatrixTensorAddressingNV |
0x2 |
DecodeFunc |
CooperativeMatrixBlockLoadsNV |
3.49.8 Memory Instructions
3.49.13. Arithmetic Instructions
3.49.11 Conversion Instructions
Relax the restrictions on Op{F,S,U,etc.}Convert from SPV_KHR_cooperative_matrix if CooperativeMatrixConversionsNV is enabled to allow Use to mismatch, where the Use of the operand can be MatrixAccumulatorKHR and the Use of the result type can be MatrixAKHR or MatrixBKHR. The restriction on OpBitcast is not relaxed.
3.49.9 Function Instructions
How are matrix type conversions with Use change handled?
Discussion: RESOLVED. We need to support conversions that change both Component Type and Use at the same time, because there is often not a supported intermediate type that matches one but not the other. For example, if converting from f32 MatrixAccumulatorKHR to u8 MatrixAKHR, there may not be support for u8 MatrixAccumulatorKHR or f32 MatrixAKHR. Conversions that change the Component Type should use Op{F,S,U,etc.}Convert even if the Use changes.
We also need to support conversions that only change the Use, for example converting from f16 MatrixAccumulatorKHR to f16 MatrixAKHR. For this, OpFConvert could be confusing/misleading so we add a new OpCooperativeMatrixConvertNV instruction for this case.
Revision History
Rev | Date | Author | Changes |
1 |
2024-09-18 |
Jeff Bolz |
Initial revision of SPV_NV_cooperative_matrix2 |