SPV_INTEL_subgroup_matrix_multiply

Name Strings

SPV_INTEL_subgroup_matrix_multiply_accumulate

Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm

Contributors

Ben Ashbaugh, Intel
Pekka Jääskeläinen, Intel
Jianhui Li, Intel
Victor Mustya, Intel
Yury Plyakhin, Intel
Dmitry Sidorov, Intel

Notice

Status

Complete

Version

Last Modified Date

2025-01-07

Revision

Dependencies

This extension is written against the SPIR-V Specification, Version 1.6, Revision 4.

This extension requires SPIR-V 1.0.

Overview

The goal of this extension is to allow programmers to access specialized hardware to compute the matrix product of an M x K matrix (referred to as Matrix A in this extension) with a K x N matrix (Matrix B) and then add an M x N matrix (Matrix C). This is a commonly used building block to compute the product of two large matrices. All invocations in the subgroup cooperate to perform this operation.

The OpSubgroupMatrixMultiplyAccumulateINTEL matrix multiply accumulate instruction added by this extension also includes an optional Matrix Multiply Accumulate Operands literal operand to specify additional information about the matrix operands, such as ways to reinterpret the bits passed as the matrix operands. This allows performing the operation on different data types that may or may not have a direct SPIR-V representation.

Many devices will only support a limited set of matrix dimensions and types. Related client API specifications will describe the required set of matrix dimensions and types, or how to query the set of supported matrix dimensions and types for a device.

Extension Name

To use this extension within a SPIR-V module, the appropriate OpExtension must be present in the module:

OpExtension "SPV_INTEL_subgroup_matrix_multiply_accumulate"

Modifications to the SPIR-V Specification, Version 1.6

Capabilities

Modify Section 3.31, Capability, adding rows to the Capability table:

Capability	Implicitly Declares
6236	SubgroupMatrixMultiplyAccumulateINTEL

Capability

Implicitly Declares

6236

SubgroupMatrixMultiplyAccumulateINTEL

Matrix Multiply Accumulate Operands

Add a new section to Section 3, Binary Form:

Matrix Multiply Accumulate Operands	Enabling Capabilities
0x0	None
0x1	MatrixASignedComponentsINTEL The components of matrix A are interpreted as signed.
0x2	MatrixBSignedComponentsINTEL The components of matrix B are interpreted as signed.
0x4	MatrixCBFloat16INTEL The components of matrix C are interpreted as bf16 data.
0x8	MatrixResultBFloat16INTEL The components of the result matrix are written as bf16 data.
0x10	MatrixAPackedInt8INTEL The components of matrix A are interpreted as packed 8-bit integer data.
0x20	MatrixBPackedInt8INTEL The components of matrix B are interpreted as packed 8-bit integer data.
0x40	MatrixAPackedInt4INTEL The components of matrix A are interpreted as packed 4-bit integer data.
0x80	MatrixBPackedInt4INTEL The components of matrix B are interpreted as packed 4-bit integer data.
0x100	MatrixATF32INTEL The components of matrix A are interpreted as tf32 data.
0x200	MatrixBTF32INTEL The components of matrix B are interpreted as tf32 data.
0x400	MatrixAPackedFloat16INTEL The components of matrix A are interpreted as packed fp16 (half-precision) data.
0x800	MatrixBPackedFloat16INTEL The components of matrix B are interpreted as packed fp16 (half-precision) data.
0x1000	MatrixAPackedBFloat16INTEL The components of matrix A are interpreted as packed bf16 data.
0x2000	MatrixBPackedBFloat16INTEL The components of matrix B are interpreted as packed bf16 data.

Matrix Multiply Accumulate Operands

Enabling Capabilities

0x0

None

0x1

MatrixASignedComponentsINTEL
The components of matrix A are interpreted as signed.

0x2

MatrixBSignedComponentsINTEL
The components of matrix B are interpreted as signed.

0x4

MatrixCBFloat16INTEL
The components of matrix C are interpreted as bf16 data.

0x8

MatrixResultBFloat16INTEL
The components of the result matrix are written as bf16 data.

0x10

MatrixAPackedInt8INTEL
The components of matrix A are interpreted as packed 8-bit integer data.

0x20

MatrixBPackedInt8INTEL
The components of matrix B are interpreted as packed 8-bit integer data.

0x40

MatrixAPackedInt4INTEL
The components of matrix A are interpreted as packed 4-bit integer data.

0x80

MatrixBPackedInt4INTEL
The components of matrix B are interpreted as packed 4-bit integer data.

0x100

MatrixATF32INTEL
The components of matrix A are interpreted as tf32 data.

0x200

MatrixBTF32INTEL
The components of matrix B are interpreted as tf32 data.

0x400

MatrixAPackedFloat16INTEL
The components of matrix A are interpreted as packed fp16 (half-precision) data.

0x800

MatrixBPackedFloat16INTEL
The components of matrix B are interpreted as packed fp16 (half-precision) data.

0x1000

MatrixAPackedBFloat16INTEL
The components of matrix A are interpreted as packed bf16 data.

0x2000

MatrixBPackedBFloat16INTEL
The components of matrix B are interpreted as packed bf16 data.

Instructions

Modify Section 3.42.21, Group Instructions, adding to the end of the list of instructions:

OpSubgroupMatrixMultiplyAccumulateINTEL

Computes the matrix product of two matrix operands and adds a third matrix operand. All invocations in the subgroup cooperate to perform this operation.

Result Type defines the result of the matrix multiply accumulate operation. It must be a scalar or vector of floating-point or integer type. The number of components in Result Type defines the M dimension of the matrix multiply accumulate operation. If Result Type is a scalar type, the M dimension is one.

K Dim defines the K dimension of the matrix multiply accumulate operation. It must come from a constant instruction with scalar 32-bit integer type.

The N dimension of the matrix multiply accumulate operation is implicitly the number of invocations in the subgroup.

Matrix A is the first matrix operand and has M rows and K columns. The type of Matrix A must be a scalar or vector of floating-point or integer type. Multiple invocations in the subgroup may contribute part of the Matrix A operand, depending on the matrix operand size and the subgroup size.

Matrix B is the second matrix operand and has K rows and N columns. It must be a scalar or vector of floating-point or integer type. Each of the invocations in the subgroup contributes part of the Matrix B operand.

Matrix C is the third matrix operand and has M rows and N columns. It must be a scalar or vector of floating-point or integer type. Each of the invocations in the subgroup contributes part of the Matrix C operand.

The multiplication step of the matrix multiply accumulate operation computes the matrix product of Matrix A and Matrix B. The product is a matrix with M rows and N columns. The order of operations to compute the elements of the matrix product is implementation-dependent.

For integer matrices, the operations used for the multiplication of Matrix A and Matrix B and the addition of Matrix C are performed at the precision of the Result Type. The resulting value will equal the low-order N bits of the correct result R, where N is the result width and R is computed with enough precision to avoid overflow and underflow.

For floating-point matrices, the precision and the order of operations are implementation-defined.

The accumulation step of the matrix multiply accumulate operation computes the element-wise addition of the matrix product of Matrix A and Matrix B with Matrix C. The final result is a matrix with M rows and N columns, which is assigned to Result.

Matrix Multiply Accumulate Operands is an optional literal that specifies additional information about the matrix operands, such as ways to reinterpret the bits passed as the matrix operands. If Matrix Multiply Accumulate Operands is not present, it is the same as specifying the Matrix Multiply Accumulate Operand None.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Capability:
SubgroupMatrixMultiplyAccumulateINTEL

7 + variable

6237

<id>
Result Type

<id>
Result

<id>
K Dim

<id>
Matrix A

<id>
Matrix B

<id>
Matrix C

Optional
Matrix Multiply Accumulate Operands

Mapping Matrix Data to Invocations

This section describes how each invocation passes its contribution to the per-subgroup Matrix A, Matrix B, and Matrix C operands, and how the per-subgroup Result matrix is assigned to each invocation.

Recall that the M dimension of the matrix multiply accumulate operation is defined by the number of components in the Result Type, the N dimension is defined implicitly by the number of invocations in the subgroup, and the K dimension is defined by the K Dim operand.

The Matrix A operand has M rows and K columns. The K columns of data are passed by the N invocations in the subgroup, with the lower-numbered invocations (invocations where SubgroupLocalInvocationId is a smaller value) passing the lower-numbered columns. This is the only matrix operand that is not dependent on N, the number of invocations in the subgroup, so there are three scenarios to consider:

If N is equal to K, then each invocation contributes a single column of Matrix A. The complete contribution for an invocation is passed as a vector of M values, with each component representing one row of the contribution, and the lower-numbered components representing the lower-numbered rows.
If N is less than K, then each invocation contributes multiple columns of Matrix A. The multiple columns are packed into a single value per row, with data from the lower-numbered columns in the lower bits of the value. The complete contribution for an invocation is passed as a vector of M values, with each component representing one row of the contribution, and the lower-numbered components representing the lower-numbered rows.
If N is greater than K, then each invocation contributes a fraction of the rows representing a single column of Matrix A. The complete contribution for an invocation is passed as a vector of values, with each component representing one row of the contribution, and the lower-numbered components representing the lower-numbered rows. For example, if N is twice as big as K, then each invocation contributes every other row of a single column of Matrix A, with the lower-numbered invocations contributing the first row and higher-numbered invocations contributing the second row. If the data from the higher-numbered invocations is not needed, such as when there is only one row M, then the data passed by the higher-numbered invocations is ignored.

The Matrix B operand has K rows and N columns. Each invocation in the subgroup contributes a single column of Matrix B, with the lower-numbered invocations (invocations where SubgroupLocalInvocationId is a smaller value) passing the lower-numbered columns. For matrix elements that are larger than 16 bits, the column of Matrix B is passed as a vector, with the lower-numbered rows of Matrix B passed as the lower-numbered components of the vector. For matrix elements that are 16 bits or smaller, the column of Matrix B is passed as a vector of 32-bit values, with the lower-numbered rows of Matrix B passed as the lower-numbered components of the vector, and lowered-numbered rows passed as the lower-numbered bits of each value. This is sometimes referred to as a packed or VNNI layout.

The Matrix C operand has M rows and N columns. Each invocation in the subgroup contributes a single column of Matrix C, with the lower-numbered invocations passing the lower-numbered columns. The column of Matrix C is passed as a vector of values, with the lower-numbered components representing lower-numbered rows.

The Result matrix has M rows and N columns, and is assigned to invocations the same as the Matrix C operand. Each invocation of the subgroup is assigned a single column of the Result matrix, with the lower-numbered invocations assigned the lower-numbered columns. The column of the Result matrix is returned as a vector of values, with the lower-numbered components representing lower-numbered rows.

Examples

Passing Matrix A with M equal to two, K equal to four, and N equal to four:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

This is the first Matrix A case, where N is equal to K. Therefore, each invocation contributes a vector representing a single column of Matrix A:
- Invocation 0 passes a vector consisting of the two values 0,0 and 1,0.
- Invocation 1 passes a vector consisting of the two values 0,1 and 1,1.
- Invocation 2 passes a vector consisting of the two values 0,2 and 1,2.
- Invocation 3 passes a vector consisting of the two values 0,3 and 1,3.
Passing Matrix A with M equal to two, K equal to eight, and N equal to four:
0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1,0

1,1

1,2

1,3

1,4

1,5

1,6

1,7

This is the second Matrix A case, where N is less than K. In this case N is two times less than K. Therefore, each invocation contributes a vector representing two columns of Matrix A; the first column in the lower bits of each vector component and the second column in the upper bits:
- Invocation 0 passes a vector consisting of the two values 0,1 | 0,0 and 1,1 | 1,0.
- Invocation 1 passes a vector consisting of the two values 0,3 | 0,2 and 1,3 | 1,2.
- Invocation 2 passes a vector consisting of the two values 0,5 | 0,4 and 1,5 | 1,4.
- Invocation 3 passes a vector consisting of the two values 0,7 | 0,6 and 1.7 | 1,6.
Passing Matrix A with M equal to four, K equal to two, and N equal to four:
0,0

0,1

1,0

1,1

2,0

2,1

3,0

3,1

This is the third Matrix A case, where N is greater than K. In this case, N is two times greater than K. Therefore, each invocation contributes a vector representing every other row of each column of Matrix A:
- Invocation 0 passes a vector consisting of the two values 0,0 and 2,0.
- Invocation 1 passes a vector consisting of the two values 0,1 and 2,1.
- Invocation 2 passes a vector consisting of the two values 1,0 and 3,0.
- Invocation 3 passes a vector consisting of the two values 1,1 and 3,1.
Passing Matrix A with M equal to one, K equal to two, and N equal to four:
0,0

0,1

This is a special-case of the third Matrix A case. In this case, N is greater than K, and there is only one row. Therefore, the data contributed by some invocations is ignored:
- Invocation 0 passes the value 0,0.
- Invocation 1 passes the value 0,1.
- The data from Invocation 2 and Invocation 3 is ignored.
Passing Matrix B with K equal to eight, N equal to four, and 8-bit data:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

4,0

4,1

4,2

4,3

5,0

5,1

5,2

5,3

6,0

6,1

6,2

6,3

7,0

7,1

7,2

7,3

Each invocation contributes a vector of two 32-bit values, where each value consists of 8-bit data from four rows of the invocation’s column data:
- Invocation 0 passes a vector consisting of the two 32-bit values 3,0 | 2,0 | 1,0 | 0,0 and 7,0 | 6,0 | 5,0 | 4,0.
- Invocation 1 passes a vector consisting of the two 32-bit values 3,1 | 2,1 | 1,1 | 0,1 and 7,1 | 6,1 | 5,1 | 4,1.
- Invocation 2 passes a vector consisting of the two 32-bit values 3,2 | 2,2 | 1,2 | 0,2 and 7,2 | 6,2 | 5,2 | 4,2.
- Invocation 3 passes a vector consisting of the two 32-bit values 3,3 | 2,3 | 1,3 | 0,3 and 7,3 | 6,3 | 5,3 | 4,3.
Passing Matrix B with K equal to four, N equal to four, and 16-bit data:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

Each invocation contributes a vector of two 32-bit values, where each value consists of 16-bit data from two rows of the invocation’s column data:
- Invocation 0 passes a vector consisting of the two 32-bit values 1,0 | 0,0 and 3,0 | 2,0.
- Invocation 1 passes a vector consisting of the two 32-bit values 1,1 | 0,1 and 3,1 | 2,1.
- Invocation 2 passes a vector consisting of the two 32-bit values 1,2 | 0,2 and 3,2 | 2,2.
- Invocation 3 passes a vector consisting of the two 32-bit values 1,3 | 0,3 and 3,3 | 2,3.
Passing Matrix B with K equal to two, N equal to four, and 32-bit data:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

Each invocation contributes a vector of two 32-bit values representing the rows of the invocation’s column data:
- Invocation 0 passes a vector consisting of the two 32-bit values 0,0 and 1,0.
- Invocation 1 passes a vector consisting of the two 32-bit values 0,1 and 1,1.
- Invocation 2 passes a vector consisting of the two 32-bit values 0,2 and 1,2.
- Invocation 3 passes a vector consisting of the two 32-bit values 0,3 and 1,3.
Passing Matrix C with M equal to two and N equal to four:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

Each invocation contributes a vector of values representing the rows of the invocation’s column data:
- Invocation 0 passes a vector consisting of the two values 0,0 and 1,0.
- Invocation 1 passes a vector consisting of the two values 0,1 and 1,1.
- Invocation 2 passes a vector consisting of the two values 0,2 and 1,2.
- Invocation 3 passes a vector consisting of the two values 0,3 and 1,3.
Assigning Result with M equal to two and N equal to four:
0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

Each invocation is assigned a vector of values representing the rows of the invocation’s column data:
- Invocation 0 is assigned a vector consisting of the two values 0,0 and 1,0.
- Invocation 1 is assigned a vector consisting of the two values 0,1 and 1,1.
- Invocation 2 is assigned a vector consisting of the two values 0,2 and 1,2.
- Invocation 3 is assigned a vector consisting of the two values 0,3 and 1,3.

Issues

What should this extension be called?

RESOLVED: The name of the extension will be SPV_INTEL_subgroup_matrix_multiply_accumulate, which aligns with the name of the related OpenCL extension cl_intel_subgroup_matrix_multiply_accumulate.
Do we need bits to indicate whether integer matrix C or result matrices are signed vs. unsigned?

RESOLVED: No, we do not currently have a use-case for unsigned interpretations for the matrix C and result matrices, even though in theory there could be a use-case in the future.

Additionally, because we do not support saturating accumulation, the same operation should work for both signed and unsigned integers due to the behavior of twos-complement arithmetic. Note that there are not separate instructions for signed vs. unsigned integer arithmetic, for example, without integer saturation.
Should the default interpretation be signed or unsigned?

RESOLVED: The default interpretation is signed. If needed, we could add interpretations for interpreting matrix C or the result matrix as unsigned in an updated version of this extension or in a layered extension.

Note though, the similar type interpretations for cooperative matrices are signed, or said another way, the default interpretation is unsigned.
Do we need different capabilities to gate each of the type interpretations?

RESOLVED: No, we do not need different capabilities to gate each of the type interpretations, and it is OK to only have the top-level SubgroupMatrixMultiplyAccumulateINTEL capability.

It will always be undefined behavior to use an unsupported matrix dimension or type, therefore adding additional capabilities for each type interpretation is not necessary.

Revision History

Rev	Date	Author	Changes
1	2025-01-07	Ben Ashbaugh	Initial revision for publication

Rev

Date

Author

Changes

2025-01-07

Ben Ashbaugh

Initial revision for publication