SPV_INTEL_2d_block

Name Strings

SPV_INTEL_2d_block_io

Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm

Contributors

Ben Ashbaugh, Intel
Pekka Jääskeläinen, Intel
Victor Mustya, Intel
Yury Plyakhin, Intel

Notice

Status

Complete

Version

Last Modified Date

2025-02-28

Revision

2

Dependencies

This extension is written against the SPIR-V Specification, Version 1.6, Revision 4.

This extension requires SPIR-V 1.0.

This extension interacts with the SPV_KHR_untyped_pointers extension, by accepting untyped pointers as pointer operands.

This extension interacts with the SPV_INTEL_cache_controls extension, by supporting cache control decorations on the pointer operands.

Overview

This extension adds additional subgroup block load and store instructions to read two-dimensional blocks of data from a two-dimensional region of memory, or to write two-dimensional blocks of data to a two dimensional region of memory. This is an important operation for many machine learning algorithms, which operate on two-dimensional matrix data as part of a matrix multiplication algorithm.

The block sizes that are supported are device-specific. A companion client API specification will describe the block sizes that are supported for a device.

This extension additionally adds support for two pre-processing operations that may be performed when loading a two-dimensional block of data:

The two-dimensional block may be transposed after loading and before it is written to the instruction’s destination.
The two-dimensional block may be transformed after loading and before it is written to the instruction’s destination. The transform operation converts the two-dimensional block from a row-major layout to a packed layout by combining data elements from multiple block rows into 32-bit values. This layout is used by some matrix multiplication instructions.

Extension Name

To use this extension within a SPIR-V module, the appropriate OpExtension must be present in the module:

OpExtension "SPV_INTEL_2d_block_io"

Modifications to the SPIR-V Specification, Version 1.6

Capabilities

Modify Section 3.31, Capability, adding rows to the Capability table:

Capability	Implicitly Declares
6228	Subgroup2DBlockIOINTEL
6229	Subgroup2DBlockTransformINTEL	Subgroup2DBlockIOINTEL
6230	Subgroup2DBlockTransposeINTEL	Subgroup2DBlockIOINTEL

Capability

Implicitly Declares

6228

Subgroup2DBlockIOINTEL

6229

Subgroup2DBlockTransformINTEL

Subgroup2DBlockIOINTEL

6230

Subgroup2DBlockTransposeINTEL

Subgroup2DBlockIOINTEL

Instructions

Modify Section 3.42.21, Group Instructions, adding to the end of the list of instructions:

OpSubgroup2DBlockLoadINTEL

Loads one or more 2D blocks of data from a 2D row-major region of memory. The 2D blocks of data are loaded collectively, as a subgroup operation.

The Element Size operand specifies the size of one block element, in bytes. The Block Width, Block Height, and Block Count operands specify the total number of elements to load. These operands must be constant instructions with scalar 32-bit integer type.

The Block Width specifies the number of elements in each block row. The Block Height specifies the number of rows in each block. The Block Count specifies the number of blocks to load. If Block Count is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.

Src Base Pointer is a pointer to the base of the 2D region of memory to load from. It must be a pointer to the CrossWorkgroup storage class.

The Memory Width, Memory Height, and Memory Pitch operands specify the 2D region of memory to load from. These operands must be integer type scalars.

The Memory Width specifies the width of the 2D region of memory, in bytes. The Memory Height specifies the number of rows in the 2D region of memory. The Memory Pitch specifies the number of bytes between each row in the 2D region of memory.

The Coordinate operand specifies the starting location in the 2D region of memory to load from. It must be a vector of two integer type components.

The first component of Coordinate specifies the number of elements to skip, from the start of a row. The second component of Coordinate specifies the number of rows to skip, from the base of the 2D region of memory.

Dst Pointer is a pointer to per-invocation storage that will hold the results of the 2D block load. It must be a pointer to the Function storage class.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Behavior is undefined unless Block Width, Block Height, Block Count, Src Base Pointer, Memory Width, Memory Height, Memory Pitch, and Coordinate are dynamically uniform for all invocations within the subgroup.

Follows the templated function:

template <typename T, int ElementSize,
          int BlockWidth, int BlockHeight, int BlockCount>
void OpSubgroup2DBlockLoadINTEL(
    const T* srcBasePointer,
    int memoryWidth,
    int memoryHeight,
    int memoryPitch,
    int2 coordinate,
    T* dstPointer);

Capability:
Subgroup2DBlockIOINTEL

11

6231

<id>
Element Size

<id>
Block Width

<id>
Block Height

<id>
Block Count

<id>
Src Base Pointer

<id>
Memory Width

<id>
Memory Height

<id>
Memory Pitch

<id>
Coordinate

<id>
Dst Pointer

OpSubgroup2DBlockLoadTransposeINTEL

Loads and transposes one or more 2D blocks of data from a 2D row-major region of memory. The 2D blocks of data are loaded collectively, as a subgroup operation.

The Element Size operand specifies the size of one block element, in bytes. The Block Width, Block Height, and Block Count operands specify the total number of elements to load. These operands must be constant instructions with scalar 32-bit integer type.

The Block Width specifies the number of elements in each block row, pre-transpose. The Block Height specifies the number of rows in each block, pre-transpose. The Block Count specifies the number of blocks to load. If Block Count is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.

Src Base Pointer is a pointer to the base of the 2D region of memory to load from. It must be a pointer to the CrossWorkgroup storage class.

The Memory Width, Memory Height, and Memory Pitch operands specify the 2D region of memory to load from. These operands must be integer type scalars.

The Memory Width specifies the width of the 2D region of memory, in bytes. The Memory Height specifies the number of rows in the 2D region of memory. The Memory Pitch specifies the number of bytes between each row in the 2D region of memory.

The Coordinate operand specifies the starting location in the 2D region of memory to load from. It must be a vector of two integer type components.

The first component of Coordinate specifies the number of elements to skip, from the start of a row. The second component of Coordinate specifies the number of rows to skip, from the base of the 2D region of memory.

Dst Pointer is a pointer to per-invocation storage that will hold the results of the transposed 2D block load. It must be a pointer to the Function storage class.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Behavior is undefined unless Block Width, Block Height, Block Count, Src Base Pointer, Memory Width, Memory Height, Memory Pitch, and Coordinate are dynamically uniform for all invocations within the subgroup.

Follows the templated function:

template <typename T, int ElementSize,
          int BlockWidth, int BlockHeight, int BlockCount>
void OpSubgroup2DBlockLoadTransposeINTEL(
    const T* srcBasePointer,
    int memoryWidth,
    int memoryHeight,
    int memoryPitch,
    int2 coordinate,
    T* dstPointer);

Capability:
Subgroup2DBlockTransposeINTEL

11

6233

<id>
Element Size

<id>
Block Width

<id>
Block Height

<id>
Block Count

<id>
Src Base Pointer

<id>
Memory Width

<id>
Memory Height

<id>
Memory Pitch

<id>
Coordinate

<id>
Dst Pointer

OpSubgroup2DBlockLoadTransformINTEL

Loads and transforms one or more 2D blocks of data into a packed format from a 2D row-major region of memory. The transformation combines elements from multiple rows of the 2D region into packed 32-bit values. The 2D blocks of data are loaded and transformed collectively, as a subgroup operation.

The Element Size operand specifies the size of one block element, in bytes. The Block Width, Block Height, and Block Count operands specify the total number of elements to load. These operands must be constant instructions with scalar 32-bit integer type.

The Block Width specifies the number of elements in each block row. The Block Height specifies the number of rows in each block. The Block Count specifies the number of blocks to load. If Block Count is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.

Src Base Pointer is a pointer to the base of the 2D region of memory to load from. It must be a pointer to the CrossWorkgroup storage class.

The Memory Width, Memory Height, and Memory Pitch operands specify the 2D region of memory to load from. These operands must be integer type scalars.

The Memory Width specifies the width of the 2D region of memory, in bytes. The Memory Height specifies the number of rows in the 2D region of memory. The Memory Pitch specifies the number of bytes between each row in the 2D region of memory.

The Coordinate operand specifies the starting location in the 2D region of memory to load from. It must be a vector of two integer type components.

The first component of Coordinate specifies the number of elements to skip, from the start of a row. The second component of Coordinate specifies the number of rows to skip, from the base of the 2D region of memory.

Dst Pointer is a pointer to per-invocation storage that will hold the results of the transformed 2D block load. It must be a pointer to the Function storage class. If it is an OpTypePointer pointer, it must point to a scalar 32-bit integer type.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Behavior is undefined unless Block Width, Block Height, Block Count, Src Base Pointer, Memory Width, Memory Height, Memory Pitch, and Coordinate are dynamically uniform for all invocations within the subgroup.

Follows the templated function:

template <typename T, int ElementSize,
          int BlockWidth, int BlockHeight, int BlockCount>
void OpSubgroup2DBlockLoadTransformINTEL(
    const T* srcBasePointer,
    int memoryWidth,
    int memoryHeight,
    int memoryPitch,
    int2 coordinate,
    uint* dstPointer);

Capability:
Subgroup2DBlockTransformINTEL

11

6232

<id>
Element Size

<id>
Block Width

<id>
Block Height

<id>
Block Count

<id>
Src Base Pointer

<id>
Memory Width

<id>
Memory Height

<id>
Memory Pitch

<id>
Coordinate

<id>
Dst Pointer

OpSubgroup2DBlockPrefetchINTEL

Prefetches one or more blocks of data from a 2D row-major region of memory into a cache. Prefetching does not affect the functionality of a module but may change its performance characteristics. The 2D blocks of data are prefetched collectively, as a subgroup operation.

The Element Size operand specifies the size of one block element, in bytes. The Block Width, Block Height, and Block Count operands specify the total number of elements to prefetch. These operands must be constant instructions with scalar 32-bit integer type.

The Block Width specifies the number of elements in each block row. The Block Height specifies the number of rows in each block. The Block Count specifies the number of blocks to prefetch. If Block Count is greater than one, the blocks are prefetched in row-major order, with the next block beginning immediately after the previous block.

Src Base Pointer is a pointer to the base of the 2D region of memory to prefetch from. It must be a pointer to the CrossWorkgroup storage class.

The Memory Width, Memory Height, and Memory Pitch operands specify the 2D region of memory to prefetch. These operands must be integer type scalars.

The Memory Width specifies the width of the 2D region of memory, in bytes. The Memory Height specifies the number of rows in the 2D region of memory. The Memory Pitch specifies the number of bytes between each row in the 2D region of memory.

The Coordinate operand specifies the starting location in the 2D region of memory to prefetch from. It must be a vector of two integer type components.

The first component of Coordinate specifies the number of elements to skip, from the start of a row. The second component of Coordinate specifies the number of rows to skip, from the base of the 2D region of memory.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Behavior is undefined unless Block Width, Block Height, Block Count, Src Base Pointer, Memory Width, Memory Height, Memory Pitch, and Coordinate are dynamically uniform for all invocations within the subgroup.

Follows the templated function:

template <typename T, int BlockWidth, int BlockHeight, int BlockPitch>
void OpSubgroup2DBlockPrefetchINTEL(
    const T* srcBasePointer,
    int memoryWidth,
    int memoryHeight,
    int memoryPitch,
    int2 coordinate);

Capability:
Subgroup2DBlockIOINTEL

10

6234

<id>
Element Size

<id>
Block Width

<id>
Block Height

<id>
Block Count

<id>
Src Pointer

<id>
Memory Width

<id>
Memory Height

<id>
Memory Pitch

<id>
Coordinate

OpSubgroup2DBlockStoreINTEL

Stores one or more 2D blocks of data to a 2D region of memory. The 2D blocks of data are stored collectively, as a subgroup operation.

The Element Size operand specifies the size of one block element, in bytes. The Block Width, Block Height, and Block Count operands specify the total number of elements to store. These operands must be constant instructions with scalar 32-bit integer type.

The Block Width specifies the number of elements in each block row. The Block Height specifies the number of rows in each block. The Block Count specifies the number of blocks to store. If Block Count is greater than one, the blocks are stored in row-major order, with the next block beginning immediately after the previous block.

Src Pointer is a pointer to per-invocation storage that holds the data to store. It must be a pointer to the Function storage class.

Dst Base Pointer is a pointer to the base of the 2D region of memory to store to. It must be a pointer to the CrossWorkgroup storage class.

The Memory Width, Memory Height, and Memory Pitch operands specify the 2D region of memory to store to. These operands must be integer type scalars.

The Memory Width specifies the width of the 2D region of memory, in bytes. The Memory Height specifies the number of rows in the 2D region of memory. The Memory Pitch specifies the number of bytes between each row in the 2D region of memory.

The Coordinate operand specifies the starting location in the 2D region of memory to store to. It must be a vector of two integer type components.

The first component of Coordinate specifies the number of elements to skip, from the start of a row. The second component of Coordinate specifies the number of rows to skip, from the base of the 2D region of memory.

Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.

Behavior is undefined unless Block Width, Block Height, Block Count, Src Base Pointer, Memory Width, Memory Height, Memory Pitch, and Coordinate are dynamically uniform for all invocations within the subgroup.

Follows the templated function:

template <typename T, int BlockWidth, int BlockHeight, int BlockPitch>
void OpSubgroup2DBlockStoreINTEL(
    const T* srcPointer,
    T* dstBasePointer,
    int memoryWidth,
    int memoryHeight,
    int memoryPitch,
    int2 coordinate);

Capability:
Subgroup2DBlockIOINTEL

11

6235

<id>
Element Size

<id>
Block Width

<id>
Block Height

<id>
Block Count

<id>
Src Pointer

<id>
Dst Base Pointer

<id>
Memory Width

<id>
Memory Height

<id>
Memory Pitch

<id>
Coordinate

Diagram

The diagram below shows the meaning of the 2D block load and store operands.

Mapping Block Data to Invocations

This section describes the mapping between the 2D block of data that is loaded or stored and the invocations in the subgroup.

First, the Block Width and Block Height are padded, if necessary. For OpSubgroup2DBlockLoadINTEL, OpSubgroup2DBlockLoadTransformINTEL, and OpSubgroup2DBlockStoreINTEL, the Block Width is padded to the next power-of-two. For OpSubgroup2DBlockLoadTransposeINTEL, the Block Height is padded to the next power-of-two. For OpSubgroup2DBlockLoadTransformINTEL, the Block Height is padded to a multiple of four for 1-byte elements, and a multiple of two for 2-byte elements. For loads, the value of any padded elements is zero. For stores, the value of any padded elements is ignored.

For OpSubgroup2DBlockLoadTransformINTEL, the loaded block data is then transformed, by combining elements from multiple rows of a single column of the 2D region and packing them into 32-bit values. For 2-byte elements, every two rows are combined into a 32-bit value, with the lower-numbered rows in the lower bits and the higher-numbered rows in the higher bits. For 1-byte elements, every four rows are are combined into a 32-bit value, with the lower-numbered rows in the lower bits and the higher-numbered rows in the higher bits. This packed layout is sometimes referred to as a VNNI layout.

For OpSubgroup2DBlockLoadTransposeINTEL, the loaded block data is then transposed, by assigning the first column of the 2D block to the first row of the transposed 2D block, and so on.

Next, the rows of the 2D block are assigned to invocations in the subgroup. Because the padded block width and the subgroup size are both powers of two, there are three scenarios to consider:

If the padded block width is equal to the subgroup size, each invocation is assigned one element of the block row.
If the padded block width is less than the subgroup size, multiple rows are assigned to the subgroup. The first row is assigned to the first set of invocations, then the next row is assigned to the next set of invocations, and so on.
If the padded block width is greater than the subgroup size, multiple elements of each block row are assigned to each invocation. The first set of elements are assigned to the first invocation, then the next set of elements are assigned to the next invocation, and so on.

In all cases, the lower numbered columns are assigned to the lower numbered invocations.

Examples

Loading a two row by four column block of elements (Block Width equals four, Block Height equals two), with a subgroup size of four, using OpSubgroup2DBlockLoadINTEL:
Block data:

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

This is the case where the padded block width is equal to the subgroup size. In this case, each invocation is assigned one element of the block row. Therefore, because there are two rows:
- Invocation 0 is assigned the values 0,0 and 1,0.
- Invocation 1 is assigned the values 0,1 and 1,1.
- Invocation 2 is assigned the values 0,2 and 1,2.
- Invocation 3 is assigned the values 0,3 and 1,3.
Loading a four row by two column block of elements (Block Width equals two, Block Height equals four), with a subgroup size of four, using OpSubgroup2DBlockLoadINTEL:
Block data:

0,0

0,1

1,0

1,1

2,0

2,1

3,0

3,1

This is the case where the padded block width is less than the subgroup size. In this case, the first row is assigned to Invocation 0 and Invocation 1, and the second row is assigned to Invocation 2 and Invocation 3, and so on. Therefore:
- Invocation 0 is assigned the values 0,0 and 2,0.
- Invocation 1 is assigned the values 0,1 and 2,1.
- Invocation 2 is assigned the values 1,0 and 3,0.
- Invocation 3 is assigned the values 1,1 and 3,1.
Loading a two row by eight column block of elements (Block Width equals eight, Block Height equals two), with a subgroup size of four, using OpSubgroup2DBlockLoadINTEL:
Block data:

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1,0

1,1

1,2

1,3

1,4

1,5

1,6

1,7

This is the case where the padded block width is greater than the subgroup size. In this case, the first set of elements of each block row is assigned to Invocation 0, the next set of elements are assigned to Invocation 1, and so on. Therefore:
- Invocation 0 is assigned the values 0,0, 0,1, 1,0, and 1,1.
- Invocation 1 is assigned the values 0,2, 0,3, 1,2, and 1,3.
- Invocation 2 is assigned the values 0,4, 0,5, 1,4, and 1,5.
- Invocation 3 is assigned the values 0,6, 0,7, 1,6, and 1,7.
Loading a four row by two column block of elements (Block Width equals two, Block Height equals four), with a subgroup size of four, using OpSubgroup2DBlockLoadTransposeINTEL:
Block data (pre-transpose):

0,0

0,1

1,0

1,1

2,0

2,1

3,0

3,1

After transposition, this is the same as the first example, so:
- Invocation 0 is assigned the values 0,0 and 0,1.
- Invocation 1 is assigned the values 1,0 and 1,1.
- Invocation 2 is assigned the values 2,0 and 2,1.
- Invocation 3 is assigned the values 3,0 and 3,1.
Loading a two row by four column block of two-byte elements (Block Width equals four, Block Height equals two), with a subgroup size of four, using OpSubgroup2DBlockLoadTransformINTEL:
Block data:

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

For two-byte elements, the transform operation combines every two rows together to form a 32-bit value. Therefore:
- Invocation 0 is assigned the 32-bit value 1,0 | 0,0.
- Invocation 1 is assigned the 32-bit value 1,1 | 0,1.
- Invocation 2 is assigned the 32-bit value 1,2 | 0,2.
- Invocation 3 is assigned the 32-bit value 1,3 | 0,3.
Loading a four row by four column block of one-byte elements (Block Width equals four, Block Height equals two), with a subgroup size of four, using OpSubgroup2DBlockLoadTransformINTEL:
Block data:

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

For one-byte elements, the transform operation combines every four rows together to form a 32-bit value. Therefore:
- Invocation 0 is assigned the 32-bit value 3,0 | 2,0 | 1,0 | 0,0.
- Invocation 1 is assigned the 32-bit value 3,1 | 2,1 | 1,1 | 0,1.
- Invocation 2 is assigned the 32-bit value 3,2 | 2,2 | 1,2 | 0,2.
- Invocation 3 is assigned the 32-bit value 3,3 | 2,3 | 1,3 | 0,3.

Out-of-Bounds Behavior

If some or all of the 2D block is out-of-bounds, where the bounds are defined by the Memory Width and Memory Height, the behavior is as follows:

For loads, any out-of-bounds elements are assigned the value zero.
For prefetches and stores, any out-of-bounds elements are ignored.

Restrictions

The following restrictions apply to the 2D block load, store and prefetch instructions added by this extension:

The Element Size must be 1, 2, 4, or 8 bytes.
The Block Width must be a multiple of four for 1-byte elements, or a multiple of two for 2-byte elements.
Behavior is undefined unless:
- the first component of Coordinate is a multiple of four for 1-byte elements, or a multiple of two for 2-byte elements.
- the per-subgroup source or destination base address is cache-line aligned (64 bytes).
- the per-invocation source or destination address is aligned to a multiple of the Element Size.
- the Memory Width is greater than or equal to 64 bytes and less than or equal to 2²⁴ bytes.
- the Memory Width is a multiple of four for 1-byte or 2-byte elements, or a multiple of the element size otherwise.
- the Memory Height is greater than zero and less than or equal to 2²⁴ rows.
- the Memory Pitch is greater than or equal to the Memory Width and a multiple of 16 bytes.
- the SubgroupMaxSize is a power of two.
- the SubgroupSize is equal to the SubgroupMaxSize; in other words, this is a full subgroup.

Issues

How should this functionality work with untyped pointers (AKA opaque pointers)?

RESOLVED: Added an Element Size operand to explicitly specify the amount of data to load or store vs. inferring the element size from typed pointers. Note, this extension does not currently includes optional Memory Operands to specify pointer alignment, because the pointer must already be aligned due to hardware restrictions..
Can we use a 32-bit integer-type scalar to represent the memory width, height, and pitch, or should we allow for 64-bit integers for very large matrices?

RESOLVED: We will use 32-bit integer-type scalars to represent the block width, height, and count, but we will allow for 64-bit integers to represent the memory width, height, and pitch, and for the block start coordinates.

The client API environment specs will restrict all of these operands to 32-bit integers initially, however.
Terminology-wise, should we use "width" and "height", or "rows" and "columns"?

RESOLVED: We will use "width" and "height" to describe both the block dimensions and the memory dimensions.
Terminology-wise, how should we describe the coordinate to read?

RESOLVED: The operand will simply be described as a vector coordinate. This avoids needing to describe "X" or "Y" or "Row" or "Column" in the operand names. The first coordinate will be the "X" or "Column" coordinate, and the second coordinate will be the "Y" or "Row" coordinate.
Terminology-wise, should we use "load" and "store", or "read" and "write"?

RESOLVED: We will use "load" and "store" for consistency with the rest of the SPIR-V specification.
What should the behavior be if some or all of the 2D block is out-of-bounds?

RESOLVED: The behavior is well-defined. Specifically, out-of-bounds reads are assigned the value zero, and out-of-bounds prefetches and stores are ignored.

Revision History

Rev	Date	Author	Changes
1	2025-01-07	Ben Ashbaugh	Initial revision for publication
2	2025-02-28	Ben Ashbaugh	Updated restrictions