Compute Shaders
This chapter is not a "how to use compute shader" article, there are plenty of resources online around GPGPU and compute.
What this chapter is for is all the "Vulkan-ism", terms, etc that are associated with compute shaders.
There is also a Decoder Ring created to help people transition from other APIs that use different terminology.
|
If you want to play around with a simple compute example, suggest taking a look at the vk-bootstrap sample. |
Coming from Vulkan Graphics
For those who are more familiar with graphics in Vulkan, compute will be a simple transition. Basically everything is the same except:
-
Call
vkCmdDispatchinstead ofvkCmdDraw -
Use
vkCreateComputePipelinesinstead ofvkCreateGraphicsPipelines -
Make sure your
VkQueuesupportsVK_QUEUE_COMPUTE_BIT -
When binding descriptors and pipelines to your command buffer, make sure to use
VK_PIPELINE_BIND_POINT_COMPUTE
SPIR-V Terminology
The smallest unit of work that is done is called an invocation. It is a "thread" or "lane" of work.
Multiple Invocations are organized into subgroups, where invocations within a subgroup can synchronize and share data with each other efficiently. (See more in the subgroup chapter)
Next we have workgroups which is the smallest unit of work that an application can define. A workgroup is a collection of invocations that execute the same shader.
|
While slightly annoying, Vulkan spec uses |
Workgroup Size
Setting the workgroup size can be done in 3 ways in SPIR-V:
A few important things to note:
-
The
WorkgroupSizedecoration will take precedence over anyLocalSizeorLocalSizeIdin the same module. -
LocalSizeIdwas added in the extensionVK_KHR_maintenance4(made core in Vulkan 1.3) to allow the ability to use specialization constants to set the size. -
There is a
maxComputeWorkGroupSizelimit how large theX,Y, andZsize can each be in each dimension. Most implementations support around 1024 for each dimension. -
There is a
maxComputeWorkGroupInvocationslimit how large the product ofX*Y*Zcan be. Most implementations support around 1024.
Local and Global Workgroups
When vkCmdDispatch is called, it sets the number of workgroups to dispatch. This produces a global workgroup space that the GPU will work on. Each single workgroup is a local workgroup. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables as well as issue memory and control flow barriers to synchronize with other members of the local workgroup.
|
There is a |
Dispatching size from a buffer
The vkCmdDispatchIndirect (and newer vkCmdDispatchIndirect2KHR) allow the size to be controlled from a buffer. This means the GPU can set the number of workgroups to dispatch.
// or any other draw/dispath that will update the memory on the GPU
vkCmdDispatch();
vkCmdPipelineBarrier(
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, // src stage
VK_ACCESS_SHADER_WRITE_BIT, // src access
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, // dst stage
VK_ACCESS_INDIRECT_COMMAND_READ_BIT, // dst access
)
// Reads VkDispatchIndirectCommand in buffer to set the number of local workgroups
vkCmdDispatchIndirect(my_buffer);
Shared memory
When inside a single local workgroup "shared memory" can be used. In SPIR-V this is referenced with the Workgroup storage class.
Shared memory is essentially the "L1 cache you can control" in your compute shader and an important part of any performant shader.
There is a maxComputeSharedMemorySize limit (mainly around 32k bytes) that needs to be accounted for.
Shared Memory Race Conditions
It is very easy to have race conditions when using shared memory.
The classic example is when multiple invocations initialize something to the same value.
shared uint my_var;
void main() {
// All the invocations in the workgroup are going to try to write to the same memory.
// RACE CONDITION
my_var = 0;
}
If you are asking "why?", the "technically correct" answer is "because the memory model says so".
When you do a weak store to a memory location, that invocation "owns" that memory location until synchronization occurs. The compiler can use that information and choose to reuse that location as temporary storage for another value.
Luckily the fix is simple, make sure to use atomics
shared uint my_var;
void main() {
atomicStore(my_var, 0u, gl_ScopeWorkgroup, 0, 0);
}
Another option is to use a OpControlBarrier with Workgroup scope (see online).
layout(local_size_x = 32) in; // 32x1x1 workgroup
shared uint my_var[32]; // one slot for each invocation
void main() {
my_var[gl_LocalInvocationIndex] = 0;
barrier(); // will generate an OpControlBarrier for you
uint x = my_var[gl_LocalInvocationIndex ^ 1];
}
Detecting shared memory data races
Luckily this problem can be caught automatically using the GPU-AV feature in Vulkan Validation Layers!
As of March 2026 (TODO - Add SDK version when released in May), GPU-AV will attempt to detect these races for you. There are a few limitations, but highly suggest trying out if having strange issues around your shared memory accesses.
Explicit Layout of shared memory
The VK_KHR_workgroup_memory_explicit_layout extension was added to allow explicit layout of shared memory.
Finding the invocation in your shader
There are many SPIR-V built-in values that can be used to find the invocation in your shader.
The following built-ins are well defined in the builtin chapter of the Vulkan spec.
-
GlobalInvocationId -
LocalInvocationId -
LocalInvocationIndex -
NumSubgroups -
NumWorkgroups -
SubgroupId -
WorkgroupId
For those who want a more "hands on" example, the following GLSL demonstrates using most of these built-ins.