VK_QCOM_shader_multiple_wait_queues
This document proposes a new extension that exposes deep asynchronous data load queues in shaders.
1. Problem Statement
High-performance compute and graphics shaders are frequently bound by memory latency.
Traditional compiler optimizations, such as instruction scheduling and loop unrolling,
are the standard methods for hiding memory latency by overlapping the gaps with
arithmetic instructions. By unrolling a loop, a compiler can issue memory loads for future iterations
(N+1, N+2) while the ALUs process the current iteration (N).
However, these techniques may not be sufficient to fully hide the latency for all workloads. Taking one example, Cooperative Matrix:
-
Register Pressure
Cooperative matrix operations may consume significant register resources. Unrolling loops to a depth sufficient to hide load latency can drastically increase the live register count, which can lead to reduced performance if the register pressure becomes too great. -
Monolithic Instructions
Cooperative matrix operations are coarse-grained, monolithic arithmetic instructions that may utilize dedicated hardware to process. Fine-grain instruction scheduling may not be possible for the implementation.
A new mechanism is needed to deepen the data fetch pipeline for such workloads.
2. Solution Space
Two primary approaches were considered:
-
Implicit Compiler Optimization
Ideally, the compiler would detect the loop structure and automatically deepen the data fetch pipeline. However, this may not always result in optimal performance for all possible loop structures. -
Explicit Loop Control
By explicitly tagging a loop with a hint, the application can fine tune the performance by influencing the compiler’s decisions.
The latter solution is a more robust solution for more use cases. Application developers should sweep the loop control hint to determine the optimal value for the device, especially when using cooperative matrix operations.
3. Proposal
The extension exposes support a new loop control hint with SPV_QCOM_multiple_wait_queues.
3.1. SPIR-V
| Loop Control | Enabling Capabilities | |
|---|---|---|
0x4000000 |
MultipleWaitQueuesQCOM |
MultipleWaitQueuesQCOM |
This functionality is gated behind a new SPIR-V capability:
| Capability | Implicitly declares | |
|---|---|---|
4539 |
MultipleWaitQueuesQCOM |
|
3.2. High Level Language Exposure
The following summarizes how the loop control hint is exposed in GLSL (requires GL_EXT_control_flow_attributes):
Attribute | Relevant on | Intent
-------------------------------+-------------+------------------------------------
*multiple_wait_queuesQCOM* | Loop | compiler may optimize the loop using multiple wait queues
*multiple_wait_queuesQCOM(N)* | Loop | compiler may optimize the loop using at most *N* multiple wait queues
3.3. Features and Properties
A new feature is added with this extension:
typedef struct VkPhysicalDeviceShaderMultipleWaitQueuesFeaturesQCOM {
VkStructureType sType;
void* pNext;
VkBool32 shaderMultipleWaitQueues;
} VkPhysicalDeviceShaderMultipleWaitQueuesFeaturesQCOM;
-
shaderMultipleWaitQueuesindicates that the implementation supports SPIR-V modules declaring theMultipleWaitQueuesQCOMcapability.
A new limit is exposed with this extension:
typedef struct VkPhysicalDeviceShaderMultipleWaitQueuesPropertiesQCOM {
VkStructureType sType;
void* pNext;
uint32_t maxShaderWaitQueues;
} VkPhysicalDeviceShaderMultipleWaitQueuesPropertiesQCOM;
-
maxShaderWaitQueuesis the maximum number of wait queues that can be set for aMultipleWaitQueuesQCOMloop control hint.
4. Examples
GLSL:
[[multiple_wait_queuesQCOM(2)]]
for (step = 0; step < TOTAL_K; step += TILE_K) {
uint subMatrixAStart = row * STRIDE_A + step;
uint subMatrixBStart = col * STRIDE_B + step;
coopMatLoad(matA, inputA.x, subMatrixAStart, STRIDE_A, gl_CooperativeMatrixLayoutRowMajor); // Global to CoopMat
coopMatLoad(matB, inputB.x, subMatrixBStart, STRIDE_B, gl_CooperativeMatrixLayoutColumnMajor); // Global to CoopMat
matC = coopMatMulAdd(matA, matB, matC);
}
SPIR-V:
%76 = OpLabel
OpLoopMerge %78 %79 MultipleWaitQueuesQCOM 2
OpBranch %80
%80 = OpLabel
%81 = OpLoad %uint %step