GPU Performance API Interface
The GPU Performance API (GPA) interface gives access to GPU performance counters, streaming performance monitors (SPM), and thread traces (SQTT), which help analyze the performance and execution characteristics of applications.
GPU Performance API Objects
GPU Performance API (GPA) session objects hold the results of sampled
performance counters and are represented by VkGpaSessionAMD handles:
// Provided by VK_AMD_gpa_interface
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkGpaSessionAMD)
To create a GPA session object, call:
// Provided by VK_AMD_gpa_interface
VkResult vkCreateGpaSessionAMD(
VkDevice device,
const VkGpaSessionCreateInfoAMD* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkGpaSessionAMD* pGpaSession);
-
deviceis the logical device that creates the GPA session object. -
pCreateInfois a pointer to a VkGpaSessionCreateInfoAMD structure containing information about how the GPA session object is to be created. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter. -
pGpaSessionis a pointer to a handle in which the resulting GPA session object is returned.
The VkGpaSessionCreateInfoAMD structure is defined as:
// Provided by VK_AMD_gpa_interface
typedef struct VkGpaSessionCreateInfoAMD {
VkStructureType sType;
const void* pNext;
VkGpaSessionAMD secondaryCopySource;
} VkGpaSessionCreateInfoAMD;
-
sTypeis a VkStructureType value identifying this structure. -
pNextisNULLor a pointer to a structure extending this structure. -
secondaryCopySourceis aVkGpaSessionAMDhandle whose results can be copied into this session.
When sampling counters inside secondary command buffers, repeated
invocations of the same secondary command buffer causes the previous results
to be overwritten.
To avoid this a GPA session object’s memory layout can be cloned for use
with multiple secondary invocations by specifying its handle in
secondaryCopySource, allowing the new GPA session object to be the
target of a copy using vkCmdCopyGpaSessionResultsAMD.
To destroy a GPA session object, call:
// Provided by VK_AMD_gpa_interface
void vkDestroyGpaSessionAMD(
VkDevice device,
VkGpaSessionAMD gpaSession,
const VkAllocationCallbacks* pAllocator);
-
deviceis the logical device that destroys the GPA session. -
gpaSessionis the handle of the GPA session to destroy. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter.
Beginning, ending, copying, and resetting a session
Performance counters are sampled between calls to vkCmdBeginGpaSessionAMD and vkCmdEndGpaSessionAMD. As long as they are executed in order, vkCmdBeginGpaSessionAMD and vkCmdEndGpaSessionAMD can span multiple command buffers.
If a session is reused after calling vkCmdEndGpaSessionAMD, the session must first be reset using vkResetGpaSessionAMD.
To begin a GPA session, call:
// Provided by VK_AMD_gpa_interface
VkResult vkCmdBeginGpaSessionAMD(
VkCommandBuffer commandBuffer,
VkGpaSessionAMD gpaSession);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session to begin.
To end a GPA session, call:
// Provided by VK_AMD_gpa_interface
VkResult vkCmdEndGpaSessionAMD(
VkCommandBuffer commandBuffer,
VkGpaSessionAMD gpaSession);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session to end.
Executing secondary command buffers multiple times, that themselves record results into sessions, causes their results to be overwritten. To prevent results from being lost due to subsequent executions, the results can be copied into another session.
To copy the results of a GPA session into another, call:
// Provided by VK_AMD_gpa_interface
void vkCmdCopyGpaSessionResultsAMD(
VkCommandBuffer commandBuffer,
VkGpaSessionAMD gpaSession);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session that is the destination of the copy.
The source of the copy is the GPA session handle provided when
gpaSession was created.
To reset a GPA session for reuse, call:
// Provided by VK_AMD_gpa_interface
VkResult vkResetGpaSessionAMD(
VkDevice device,
VkGpaSessionAMD gpaSession);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session to reset.
Resetting a session object has less overhead than destroying and then creating a new one.
Beginning and ending sampling
Once a session has begun, samples can then be captured during the command buffer execution.
To begin a GPA sample, call:
// Provided by VK_AMD_gpa_interface
VkResult vkCmdBeginGpaSampleAMD(
VkCommandBuffer commandBuffer,
VkGpaSessionAMD gpaSession,
const VkGpaSampleBeginInfoAMD* pGpaSampleBeginInfo,
uint32_t* pSampleID);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session to record the sample. -
pGpaSampleBeginInfois a pointer to a VkGpaSampleBeginInfoAMD structure describing the sample parameters. -
pSampleIDis the unique sample ID returned for this sample.
The VkGpaSampleBeginInfoAMD structure is defined as:
// Provided by VK_AMD_gpa_interface
typedef struct VkGpaSampleBeginInfoAMD {
VkStructureType sType;
const void* pNext;
VkGpaSampleTypeAMD sampleType;
VkBool32 sampleInternalOperations;
VkBool32 cacheFlushOnCounterCollection;
VkBool32 sqShaderMaskEnable;
VkGpaSqShaderStageFlagsAMD sqShaderMask;
uint32_t perfCounterCount;
const VkGpaPerfCounterAMD* pPerfCounters;
uint32_t streamingPerfTraceSampleInterval;
VkDeviceSize perfCounterDeviceMemoryLimit;
VkBool32 sqThreadTraceEnable;
VkBool32 sqThreadTraceSuppressInstructionTokens;
VkDeviceSize sqThreadTraceDeviceMemoryLimit;
VkPipelineStageFlags timingPreSample;
VkPipelineStageFlags timingPostSample;
} VkGpaSampleBeginInfoAMD;
-
sTypeis a VkStructureType value identifying this structure. -
pNextisNULLor a pointer to a structure extending this structure. -
sampleTypeis a VkGpaSampleTypeAMD value specifying the type of sample. -
sampleInternalOperationsis a boolean indicating whether internal driver operations are included in the results. -
cacheFlushOnCounterCollectionis a boolean indicating whether the driver should insert cache flush and invalidate events before and after every sample. -
sqShaderMaskEnableis a boolean indicating whethersqShaderMaskspecifies the shader stages to sample. If VK_FALSE, all shader stages are sampled. -
sqShaderMaskis a bitmask of VkGpaSqShaderStageFlagBitsAMD values specifying which shader stages to sample. Shader stage bits that are not relevant to the specific device are ignored. -
perfCounterCountis the number of counters in thepPerfCountersarray -
pPerfCountersis a pointer to an array of VkGpaPerfCounterAMD structures specifying the counters to be sampled. IfsampleTypeis VK_GPA_SAMPLE_TYPE_CUMULATIVE_AMDpPerfCountersspecifies the counters that are sampled at the beginning and at end of the sample period. IfsampleTypeis VK_GPA_SAMPLE_TYPE_TRACE_AMD then the SPM data will be added to the samples RGP data blob. -
streamingPerfTraceSampleIntervalis a value specifying the period for SPM samples in cycles, and is ignored ifsampleTypeis not VK_GPA_SAMPLE_TYPE_TRACE_AMD. -
perfCounterDeviceMemoryLimitis a value specifying the maximum amount of GPU memory that this sample can allocate for SPM data. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TRACE_AMD this value is ignored. -
sqThreadTraceEnableis a boolean specifying whether SQTT data should be included. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TRACE_AMD this value is ignored. -
sqThreadTraceSuppressInstructionTokensis a boolean specifying whether instruction-level SQTT tokens should be captured. If VK_FALSE, the amount of sample data is significantly reduced. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TRACE_AMD this value is ignored. -
sqThreadTraceDeviceMemoryLimitis a value specifying the maximum amount of GPU memory in bytes that this sample can allocate for the SQTT buffer. If0, the maximum size to prevent dropping tokens towards the end of the sample is allocated. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TRACE_AMD this value is ignored. -
timingPreSampleis a VkPipelineStageFlags2 mask of pipeline stages specifying where the begin timestamp should take place. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TIMING_AMD this value is ignored. -
timingPostSampleis a VkPipelineStageFlags2 mask of pipeline stages specifying where the end timestamp should take place. IfsampleTypeis not VK_GPA_SAMPLE_TYPE_TIMING_AMD this value is ignored.
Values which can be set in VkGpaSampleBeginInfoAMD::sampleType
are:
// Provided by VK_AMD_gpa_interface
typedef enum VkGpaSampleTypeAMD {
VK_GPA_SAMPLE_TYPE_CUMULATIVE_AMD = 0,
VK_GPA_SAMPLE_TYPE_TRACE_AMD = 1,
VK_GPA_SAMPLE_TYPE_TIMING_AMD = 2,
} VkGpaSampleTypeAMD;
-
VK_GPA_SAMPLE_TYPE_CUMULATIVE_AMD specifies that one 64-bit result will be returned per global performance counter, representing the cumulative delta for that counter over the sample period. Cumulative samples must begin and end in the same command buffer.
-
VK_GPA_SAMPLE_TYPE_TRACE_AMD specifies that a buffer will be filled with SQTT results data in RGP file format, and/or streaming performance monitor data. Trace samples may span multiple command buffers.
-
VK_GPA_SAMPLE_TYPE_TIMING_AMD specifies that two 64-bit results will be recorded to gather timestamp data.
Bits which can be set to control which GPU shader stages are sampled, are:
// Provided by VK_AMD_gpa_interface
typedef enum VkGpaSqShaderStageFlagBitsAMD {
VK_GPA_SQ_SHADER_STAGE_PS_BIT_AMD = 0x00000001,
VK_GPA_SQ_SHADER_STAGE_VS_BIT_AMD = 0x00000002,
VK_GPA_SQ_SHADER_STAGE_GS_BIT_AMD = 0x00000004,
VK_GPA_SQ_SHADER_STAGE_ES_BIT_AMD = 0x00000008,
VK_GPA_SQ_SHADER_STAGE_HS_BIT_AMD = 0x00000010,
VK_GPA_SQ_SHADER_STAGE_LS_BIT_AMD = 0x00000020,
VK_GPA_SQ_SHADER_STAGE_CS_BIT_AMD = 0x00000040,
} VkGpaSqShaderStageFlagBitsAMD;
-
VK_GPA_SQ_SHADER_STAGE_PS_BIT_AMD specifies the pixel shader stage.
-
VK_GPA_SQ_SHADER_STAGE_VS_BIT_AMD specifies the vertex shader stage.
-
VK_GPA_SQ_SHADER_STAGE_GS_BIT_AMD specifies the geometry shader stage.
-
VK_GPA_SQ_SHADER_STAGE_ES_BIT_AMD specifies the export shader stage.
-
VK_GPA_SQ_SHADER_STAGE_HS_BIT_AMD specifies the hull shader stage.
-
VK_GPA_SQ_SHADER_STAGE_LS_BIT_AMD specifies the local shader stage.
-
VK_GPA_SQ_SHADER_STAGE_CS_BIT_AMD specifies the compute shader stage.
// Provided by VK_AMD_gpa_interface
typedef VkFlags VkGpaSqShaderStageFlagsAMD;
VkGpaSqShaderStageFlagsAMD is a bitmask type for setting a mask of
zero or more VkGpaSqShaderStageFlagBitsAMD.
The VkGpaPerfCounterAMD structure is defined as:
// Provided by VK_AMD_gpa_interface
typedef struct VkGpaPerfCounterAMD {
VkGpaPerfBlockAMD blockType;
uint32_t blockInstance;
uint32_t eventID;
} VkGpaPerfCounterAMD;
-
blockTypeis a VkGpaPerfBlockAMD value specifying the GPU block type to sample. -
blockInstanceis a value specifying which instance of the GPU block to sample. -
eventIDis a value specifying the hardware-specific identifier of the performance counter to sample.
Values which can be set in VkGpaPerfCounterAMD::blockType are:
// Provided by VK_AMD_gpa_interface
typedef enum VkGpaPerfBlockAMD {
VK_GPA_PERF_BLOCK_CPF_AMD = 0,
VK_GPA_PERF_BLOCK_IA_AMD = 1,
VK_GPA_PERF_BLOCK_VGT_AMD = 2,
VK_GPA_PERF_BLOCK_PA_AMD = 3,
VK_GPA_PERF_BLOCK_SC_AMD = 4,
VK_GPA_PERF_BLOCK_SPI_AMD = 5,
VK_GPA_PERF_BLOCK_SQ_AMD = 6,
VK_GPA_PERF_BLOCK_SX_AMD = 7,
VK_GPA_PERF_BLOCK_TA_AMD = 8,
VK_GPA_PERF_BLOCK_TD_AMD = 9,
VK_GPA_PERF_BLOCK_TCP_AMD = 10,
VK_GPA_PERF_BLOCK_TCC_AMD = 11,
VK_GPA_PERF_BLOCK_TCA_AMD = 12,
VK_GPA_PERF_BLOCK_DB_AMD = 13,
VK_GPA_PERF_BLOCK_CB_AMD = 14,
VK_GPA_PERF_BLOCK_GDS_AMD = 15,
VK_GPA_PERF_BLOCK_SRBM_AMD = 16,
VK_GPA_PERF_BLOCK_GRBM_AMD = 17,
VK_GPA_PERF_BLOCK_GRBM_SE_AMD = 18,
VK_GPA_PERF_BLOCK_RLC_AMD = 19,
VK_GPA_PERF_BLOCK_DMA_AMD = 20,
VK_GPA_PERF_BLOCK_MC_AMD = 21,
VK_GPA_PERF_BLOCK_CPG_AMD = 22,
VK_GPA_PERF_BLOCK_CPC_AMD = 23,
VK_GPA_PERF_BLOCK_WD_AMD = 24,
VK_GPA_PERF_BLOCK_TCS_AMD = 25,
VK_GPA_PERF_BLOCK_ATC_AMD = 26,
VK_GPA_PERF_BLOCK_ATC_L2_AMD = 27,
VK_GPA_PERF_BLOCK_MC_VM_L2_AMD = 28,
VK_GPA_PERF_BLOCK_EA_AMD = 29,
VK_GPA_PERF_BLOCK_RPB_AMD = 30,
VK_GPA_PERF_BLOCK_RMI_AMD = 31,
VK_GPA_PERF_BLOCK_UMCCH_AMD = 32,
VK_GPA_PERF_BLOCK_GE_AMD = 33,
VK_GPA_PERF_BLOCK_GL1A_AMD = 34,
VK_GPA_PERF_BLOCK_GL1C_AMD = 35,
VK_GPA_PERF_BLOCK_GL1CG_AMD = 36,
VK_GPA_PERF_BLOCK_GL2A_AMD = 37,
VK_GPA_PERF_BLOCK_GL2C_AMD = 38,
VK_GPA_PERF_BLOCK_CHA_AMD = 39,
VK_GPA_PERF_BLOCK_CHC_AMD = 40,
VK_GPA_PERF_BLOCK_CHCG_AMD = 41,
VK_GPA_PERF_BLOCK_GUS_AMD = 42,
VK_GPA_PERF_BLOCK_GCR_AMD = 43,
VK_GPA_PERF_BLOCK_PH_AMD = 44,
VK_GPA_PERF_BLOCK_UTCL1_AMD = 45,
VK_GPA_PERF_BLOCK_GE_DIST_AMD = 46,
VK_GPA_PERF_BLOCK_GE_SE_AMD = 47,
VK_GPA_PERF_BLOCK_DF_MALL_AMD = 48,
VK_GPA_PERF_BLOCK_SQ_WGP_AMD = 49,
VK_GPA_PERF_BLOCK_PC_AMD = 50,
VK_GPA_PERF_BLOCK_GL1XA_AMD = 51,
VK_GPA_PERF_BLOCK_GL1XC_AMD = 52,
VK_GPA_PERF_BLOCK_WGS_AMD = 53,
VK_GPA_PERF_BLOCK_EACPWD_AMD = 54,
VK_GPA_PERF_BLOCK_EASE_AMD = 55,
VK_GPA_PERF_BLOCK_RLCUSER_AMD = 56,
VK_GPA_PERF_BLOCK_GE1_AMD = VK_GPA_PERF_BLOCK_GE_AMD,
VK_GPA_PERF_BLOCK_RLCLOCAL_AMD = VK_GPA_PERF_BLOCK_RLCUSER_AMD,
} VkGpaPerfBlockAMD;
To end a GPA sample, call:
// Provided by VK_AMD_gpa_interface
void vkCmdEndGpaSampleAMD(
VkCommandBuffer commandBuffer,
VkGpaSessionAMD gpaSession,
uint32_t sampleID);
-
commandBufferis the command buffer into which the command will be recorded. -
gpaSessionis the handle of the GPA session that is recording the sample. -
sampleIDis a unique sample ID returned by a previous call to vkCmdBeginGpaSampleAMD.
Controlling GPU clocks
For performance counters and thread tracing to produce meaningful results, clock control and querying is available.
To set or query GPU clocks, call:
// Provided by VK_AMD_gpa_interface
VkResult vkSetGpaDeviceClockModeAMD(
VkDevice device,
VkGpaDeviceClockModeInfoAMD* pInfo);
-
deviceis the logical device that sets the clocks. -
pInfois a pointer to a VkGpaDeviceClockModeInfoAMD structure specifying the clock mode to set or query.
The VkGpaDeviceClockModeInfoAMD structure is defined as:
// Provided by VK_AMD_gpa_interface
typedef struct VkGpaDeviceClockModeInfoAMD {
VkStructureType sType;
const void* pNext;
VkGpaDeviceClockModeAMD clockMode;
float memoryClockRatioToPeak;
float engineClockRatioToPeak;
} VkGpaDeviceClockModeInfoAMD;
-
sTypeis a VkStructureType value identifying this structure. -
pNextisNULLor a pointer to a structure extending this structure. -
clockModeis a enum:VkGpaDeviceClockModeAMD value specify which clock mode to set, or whether to query the current clocks. -
memoryClockRatioToPeakis the returned ratio of the current memory clock to the maximum memory clock, ifclockModeis VK_GPA_DEVICE_CLOCK_MODE_QUERY_AMD. -
engineClockRatioToPeakis the returned ratio of the current engine clock to the maximum engine clock, ifclockModeis VK_GPA_DEVICE_CLOCK_MODE_QUERY_AMD.
If clockMode is VK_GPA_DEVICE_CLOCK_MODE_QUERY_AMD,
memoryClockRatioToPeak and engineClockRatioToPeak are filled
with the ratios of their current values to their maximums respectively,
otherwise they are left unchanged.
Values which can be set in VkGpaDeviceClockModeAMD are:
// Provided by VK_AMD_gpa_interface
typedef enum VkGpaDeviceClockModeAMD {
VK_GPA_DEVICE_CLOCK_MODE_DEFAULT_AMD = 0,
VK_GPA_DEVICE_CLOCK_MODE_QUERY_AMD = 1,
VK_GPA_DEVICE_CLOCK_MODE_PROFILING_AMD = 2,
VK_GPA_DEVICE_CLOCK_MODE_MIN_MEMORY_AMD = 3,
VK_GPA_DEVICE_CLOCK_MODE_MIN_ENGINE_AMD = 4,
VK_GPA_DEVICE_CLOCK_MODE_PEAK_AMD = 5,
} VkGpaDeviceClockModeAMD;
-
VK_GPA_DEVICE_CLOCK_MODE_DEFAULT_AMD specifies that device clocks and other power settings are restored to their default values.
-
VK_GPA_DEVICE_CLOCK_MODE_QUERY_AMD specifies that the current clock values should be queried, with no new values set.
-
VK_GPA_DEVICE_CLOCK_MODE_PROFILING_AMD specifies that clocks are set to a constant amount which is known to be power and thermal sustainable. The engine/memory clock ratio will be kept the same as much as possible.
-
VK_GPA_DEVICE_CLOCK_MODE_MIN_MEMORY_AMD specifies that the memory clock is set to the lowest available level and the engine clock is set to a thermal and power sustainable level.
-
VK_GPA_DEVICE_CLOCK_MODE_MIN_ENGINE_AMD specifies that the engine clock is set to the lowest available level and the memory clock is set to a thermal and power sustainable level.
-
VK_GPA_DEVICE_CLOCK_MODE_PEAK_AMD specifies that the clocks set to maximum when possible and fans set to maximum. Under power and thermal constraints device will clock down.
To query the GPU clock current values and their ratios to their maximums, call:
// Provided by VK_AMD_gpa_interface
VkResult vkGetGpaDeviceClockInfoAMD(
VkDevice device,
VkGpaDeviceGetClockInfoAMD* pInfo);
-
deviceis the logical device that sets the clocks. -
pInfois a pointer to a VkGpaDeviceGetClockInfoAMD structure in which the clock values are returned.
The VkGpaDeviceGetClockInfoAMD structure is defined as:
// Provided by VK_AMD_gpa_interface
typedef struct VkGpaDeviceGetClockInfoAMD {
VkStructureType sType;
void* pNext;
float memoryClockRatioToPeak;
float engineClockRatioToPeak;
uint32_t memoryClockFrequency;
uint32_t engineClockFrequency;
} VkGpaDeviceGetClockInfoAMD;
-
sTypeis a VkStructureType value identifying this structure. -
pNextisNULLor a pointer to a structure extending this structure. -
memoryClockRatioToPeakis the returned ratio of the current memory clock to the maximum memory clock. -
engineClockRatioToPeakis the returned ratio of the current engine clock to the maximum engine clock. -
memoryClockFrequencyis the current memory clock frequency in MHz. -
engineClockFrequencyis the current engine clock frequency in MHz.
Session status and results querying
Once a GPA session has ended, its status can be queried by:
// Provided by VK_AMD_gpa_interface
VkResult vkGetGpaSessionStatusAMD(
VkDevice device,
VkGpaSessionAMD gpaSession);
-
deviceis the logical device that sets the clocks. -
gpaSessionis the session whose status is queried.
A return value of VK_SUCCESS indicates that the results are available to be read using vkGetGpaSessionResultsAMD. If results are not available, VK_NOT_READY is returned.
Once a GPA session’s results are available, they can be obtained by:
// Provided by VK_AMD_gpa_interface
VkResult vkGetGpaSessionResultsAMD(
VkDevice device,
VkGpaSessionAMD gpaSession,
uint32_t sampleID,
size_t* pSizeInBytes,
void* pData);
-
deviceis the logical device that sets the clocks. -
gpaSessionis the session whose results are queried. -
sampleIDis the sample ID, returned by vkCmdBeginGpaSampleAMD, whose results are to be queried. -
pSizeInBytesis a pointer to asize_tvalue related to the size of the results data, as described below. -
pDatais eitherNULLor a pointer to an array ofpSizeInBytesbytes where the results will be written.
If pData is NULL, then the number of bytes of data in the results is
returned in pSizeInBytes.
Otherwise, pSizeInBytes must point to a variable set by the
application to the number of elements in the pData array, and on
return the variable is overwritten with the number of bytes written to
pData.
If the value of pSizeInBytes is less than the size required to write
the results VK_INCOMPLETE will be returned instead of
VK_SUCCESS, to indicate that the results were not written.