USM Basic Concept#

Unified Shared Memory (USM) provides a pointer-based alternative to the buffer programming model.

USM enables:

  • Easier integration into existing code bases by representing allocations as pointers rather than buffers, with full support for pointer arithmetic into allocations.

  • Fine-grain control over ownership and accessibility of allocations, to optimally choose between performance and programmer convenience.

  • A simpler programming model, by automatically migrating some allocations between SYCL devices and the host.

See also

SYCL Specification Section 4.8

Unified addressing#

Unified Addressing guarantees that all devices will use a unified address space.

Pointer values in the unified address space will always refer to the same location in memory. The unified address space encompasses the host and one or more devices.

Note that this does not require addresses in the unified address space to be accessible on all devices, just that pointer values will be consistent.

Kinds of USM#

USM builds upon Unified Addressing to define a shared address space where pointer values in this space always refer to the same location in memory. USM defines three types of memory allocations: host, device, and shared.

The following enum is used to refer to the different types of allocations inside of a SYCL program:

namespace sycl::usm {

enum class alloc : /* unspecified */ {
  host,
  device,
  shared,
  unknown
};

} // namespace sycl::usm

USM allocation type

Description

host

Allocations in host memory that are accessible by a device (in addition to the host).

device

Allocations in device memory that are not accessible by the host.

shared

Allocations in shared memory that are accessible by both host and device.

USM is an optional feature which may not be supported by all devices, and devices that support USM may not support all types of USM allocation.

A SYCL application can use the sycl::device::has() function to determine the level of USM support for a device (See Device aspects).

See also

SYCL Specification Section 4.8.2

USM accesses must be within the sycl::context used for allocation#

Each USM allocation has an associated SYCL context, and any access to that memory must use the same context. Specifically, any SYCL kernel function that dereferences a pointer to a USM allocation must be submitted to a sycl::queue that was constructed with the same context that was used to allocate that memory. The explicit memory operation commands that take USM pointers have a similar restriction.

There are no similar restrictions for dereferencing a USM pointer in a host task. This is legal regardless of which queue the host task was submitted to so long as the USM pointer is accessible on the host.

Warning

Each type of USM allocation has different rules for where that memory is accessible. Attempting to dereference a USM pointer on the host or on a device in violation of these rules results in undefined behavior.

Passing a USM pointer to one of the explicit memory functions where the pointer is not accessible to the device generally results in undefined behavior.

Host allocations#

Host allocations allow devices to directly read and write host memory inside of a kernel.

Host allocations must also be obtained using SYCL routines instead of system allocation routines. While a device may remotely read and write a host allocation, the allocation does not migrate to the device - it remains in host memory.

Warning

Users should take care to properly synchronize access to host allocations between host execution and kernels.

The total size of host allocations will be limited by the amount of pinnable-memory on the host on most systems.

Support for host allocations on a specific device can be queried through sycl::aspect::usm_host_allocations.

Support for atomic modification of host allocations on a specific device can be queried through sycl::aspect::usm_atomic_host_allocations.

Device allocations#

Device allocations are used for explicitly managing device memory.

Device allocations are obtained through SYCL device USM allocation routines instead of system allocation routines like std::malloc or C++ new.

Device allocations are not accessible on the host, but the pointer values remain consistent on account of Unified addressing.

With device allocations data is directly allocated in the device memory and it must be explicitly copied between the host and a device.

The size of device allocations will be limited by the amount of memory in a device.

Support for device allocations on a specific device can be queried through sycl::aspect::usm_device_allocations.

The member functions to copy and initialize data are found in sycl::queue shortcut functions and sycl::handler explicit memory operations, and these functions may be used on device allocations if a device supports sycl::aspect::usm_device_allocations.

Example

See usm-example-2.

Shared allocations#

Shared allocations implicitly share data between the host and devices.

Data may move to where it is being used without the programmer explicitly informing the runtime. It is up to the runtime and backends to make sure that a shared allocation is available where it is used.

Shared allocations must also be obtained using SYCL allocation routines instead of the system allocator.

The maximum size of a shared allocation on a specific device, and the total size of all shared allocations in a context, are implementation-defined.

Support for shared allocations on a specific device can be queried through sycl::aspect::usm_shared_allocations.

Warning

Not all devices may support concurrent access of a shared allocation with the host.

If a device does not support this, host execution and device code must take turns accessing the allocation, so the host must not access a shared allocation while a kernel is executing.

Warning

Host access to a shared allocation which is also accessed by an executing kernel on a device that does not support concurrent access results in undefined behavior.

If a device does support concurrent access, both the host and and the device may atomically modify the same data inside an allocation.

Allocations, or pieces of allocations, are now free to migrate to different devices in the same context that also support this capability. Additionally, many devices that support concurrent access may support a working set of shared allocations larger than device memory.

Whether a device supports concurrent access with atomic modification of shared allocations can be queried through the aspect sycl::aspect::usm_atomic_shared_allocations.

Performance hints

  1. Performance hints for shared allocations may be specified by the user by enqueueing prefetch operations on a device. These operations inform the SYCL runtime that the specified shared allocation is likely to be accessed on the device in the future, and that it is free to migrate the allocation to the device. If a device supports concurrent access to shared allocations, then prefetch operations may be overlapped with kernel execution. More about prefetch is found in sycl::queue shortcut functions and sycl::handler explicit memory operations,

  2. Users also may use the mem_advise member function to annotate shared allocations with advice. Valid advice is defined by the device and its associated backend. See sycl::queue shortcut functions and sycl::handler explicit memory operations, for more information.

Example

See usm-example-1.

System allocations#

In the most capable systems, users do not need to use SYCL USM allocation functions to create shared allocations. The system allocator (malloc/new) may instead be used. Likewise, std::free and delete are used instead of sycl::free.

Users may query the device to determine if system allocations are supported for use on the device, through sycl::aspect::usm_system_allocations.

Note

Host and device allocations are unaffected by this change and must still be allocated using their respective USM functions in order to guarantee their behavior.

Example 1#

Example of how shared memory can be used between host and device:

 1#include <sycl/sycl.hpp>
 2
 3#include <iostream>
 4
 5int main() {
 6  //  Create a default queue to enqueue work to the default device
 7  sycl::queue myQueue;
 8
 9  // Allocate shared memory bound to the device and context associated to the
10  // queue Replacing malloc_shared with malloc_host would yield a correct
11  // program that allocated device-visible memory on the host.
12  int *data = sycl::malloc_shared<int>(1024, myQueue);
13
14  myQueue.parallel_for(1024, [=](sycl::id<1> idx) {
15    // Initialize each buffer element with its own rank number starting at 0
16    data[idx] = idx;
17  }); // End of the kernel function
18
19  // Explicitly wait for kernel execution since there is no accessor involved
20  myQueue.wait();
21
22  // Print result
23  for (int i = 0; i < 1024; i++)
24    std::cout << "data[" << i << "] = " << data[i] << std::endl;
25
26  return 0;
27}

Example 2#

Example of using less capable device memory, which requires an explicit copy between the device and the host:

 1#include <sycl/sycl.hpp>
 2
 3#include <iostream>
 4
 5int main() {
 6  // Create a default queue to enqueue work to the default device
 7  sycl::queue myQueue;
 8
 9  // Allocate shared memory bound to the device and context associated to the
10  // queue
11  int *data = sycl::malloc_device<int>(1024, myQueue);
12
13  myQueue.parallel_for(1024, [=](sycl::id<1> idx) {
14    // Initialize each buffer element with its own rank number starting at 0
15    data[idx] = idx;
16  }); // End of the kernel function
17
18  // Explicitly wait for kernel execution since there is no accessor involved
19  myQueue.wait();
20
21  // Create an array to receive the device content
22  int hostData[1024];
23  // Receive the content from the device
24  myQueue.memcpy(hostData, data, 1024 * sizeof(int));
25  // Wait for the copy to complete
26  myQueue.wait();
27
28  // Print result
29  for (int i = 0; i < 1024; i++)
30    std::cout << "hostData[" << i << "] = " << hostData[i] << std::endl;
31
32  return 0;
33}