Compute Shaders

This chapter is not a "how to use compute shader" article, there are plenty of resources online around GPGPU and compute.

What this chapter is for is all the "Vulkan-ism", terms, etc that are associated with compute shaders.

There is also a Decoder Ring created to help people transition from other APIs that use different terminology.

If you want to play around with a simple compute example, suggest taking a look at the vk-bootstrap sample.

Coming from Vulkan Graphics

For those who are more familiar with graphics in Vulkan, compute will be a simple transition. Basically everything is the same except:

  • Call vkCmdDispatch instead of vkCmdDraw

  • Use vkCreateComputePipelines instead of vkCreateGraphicsPipelines

  • Make sure your VkQueue supports VK_QUEUE_COMPUTE_BIT

  • When binding descriptors and pipelines to your command buffer, make sure to use VK_PIPELINE_BIND_POINT_COMPUTE

SPIR-V Terminology

The smallest unit of work that is done is called an invocation. It is a "thread" or "lane" of work.

Multiple Invocations are organized into subgroups, where invocations within a subgroup can synchronize and share data with each other efficiently. (See more in the subgroup chapter)

Next we have workgroups which is the smallest unit of work that an application can define. A workgroup is a collection of invocations that execute the same shader.

While slightly annoying, Vulkan spec uses WorkGroup while the SPIR-V spec spells it as Workgroup. It has no significant meaning, other than a potential typo when going between the two.

Workgroup Size

Setting the workgroup size can be done in 3 ways in SPIR-V:

  1. Using the WorkgroupSize built-in (example)

  2. Using the LocalSize execution mode (example)

  3. Using the LocalSizeId execution mode (example)

Examples how to do this in glsl, hlsl, and slang

A few important things to note:

  • The WorkgroupSize decoration will take precedence over any LocalSize or LocalSizeId in the same module.

  • LocalSizeId was added in the extension VK_KHR_maintenance4 (made core in Vulkan 1.3) to allow the ability to use specialization constants to set the size.

  • There is a maxComputeWorkGroupSize limit how large the X, Y, and Z size can each be in each dimension. Most implementations support around 1024 for each dimension.

  • There is a maxComputeWorkGroupInvocations limit how large the product of X * Y * Z can be. Most implementations support around 1024.

Local and Global Workgroups

When vkCmdDispatch is called, it sets the number of workgroups to dispatch. This produces a global workgroup space that the GPU will work on. Each single workgroup is a local workgroup. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables as well as issue memory and control flow barriers to synchronize with other members of the local workgroup.

There is a maxComputeWorkGroupCount limit some hardware supports only 64k, but newer hardware can basically be unlimited here.

Dispatching size from a buffer

The vkCmdDispatchIndirect (and newer vkCmdDispatchIndirect2KHR) allow the size to be controlled from a buffer. This means the GPU can set the number of workgroups to dispatch.

// or any other draw/dispath that will update the memory on the GPU
vkCmdDispatch();

vkCmdPipelineBarrier(
    VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, // src stage
    VK_ACCESS_SHADER_WRITE_BIT,           // src access
    VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,  // dst stage
    VK_ACCESS_INDIRECT_COMMAND_READ_BIT,  // dst access
)

// Reads VkDispatchIndirectCommand in buffer to set the number of local workgroups
vkCmdDispatchIndirect(my_buffer);

Shared memory

When inside a single local workgroup "shared memory" can be used. In SPIR-V this is referenced with the Workgroup storage class.

Shared memory is essentially the "L1 cache you can control" in your compute shader and an important part of any performant shader.

There is a maxComputeSharedMemorySize limit (mainly around 32k bytes) that needs to be accounted for.

Shared Memory Race Conditions

It is very easy to have race conditions when using shared memory.

The classic example is when multiple invocations initialize something to the same value.

shared uint my_var;
void main() {
    // All the invocations in the workgroup are going to try to write to the same memory.
    // RACE CONDITION
    my_var = 0;
}

If you are asking "why?", the "technically correct" answer is "because the memory model says so".

When you do a weak store to a memory location, that invocation "owns" that memory location until synchronization occurs. The compiler can use that information and choose to reuse that location as temporary storage for another value.

Luckily the fix is simple, make sure to use atomics

shared uint my_var;
void main() {
    atomicStore(my_var, 0u, gl_ScopeWorkgroup, 0, 0);
}

Another option is to use a OpControlBarrier with Workgroup scope (see online).

layout(local_size_x = 32) in; // 32x1x1 workgroup
shared uint my_var[32]; // one slot for each invocation

void main() {
    my_var[gl_LocalInvocationIndex] = 0;
    barrier(); // will generate an OpControlBarrier for you
    uint x = my_var[gl_LocalInvocationIndex ^ 1];
}

Detecting shared memory data races

Luckily this problem can be caught automatically using the GPU-AV feature in Vulkan Validation Layers!

As of March 2026 (TODO - Add SDK version when released in May), GPU-AV will attempt to detect these races for you. There are a few limitations, but highly suggest trying out if having strange issues around your shared memory accesses.

Explicit Layout of shared memory

The VK_KHR_workgroup_memory_explicit_layout extension was added to allow explicit layout of shared memory.

Finding the invocation in your shader

There are many SPIR-V built-in values that can be used to find the invocation in your shader.

The following built-ins are well defined in the builtin chapter of the Vulkan spec.

  • GlobalInvocationId

  • LocalInvocationId

  • LocalInvocationIndex

  • NumSubgroups

  • NumWorkgroups

  • SubgroupId

  • WorkgroupId

For those who want a more "hands on" example, the following GLSL demonstrates using most of these built-ins.