Tensor Tiling Library
Tensor & Tiling Library is an open-source library to enable the efficient tiling and compute with tensors.
Please reach out to chris.nosp@m..gea.nosp@m.ring@.nosp@m.mobi.nosp@m.leye..nosp@m.com or ayal..nosp@m.zaks.nosp@m.@mobi.nosp@m.leye.nosp@m..com for more information.
This document outlines the purpose of this sample implementation and provides build and execution instructions.
Contents
- Purpose
- Example
- Doxygen
- Building And Executing
- CMake
- Tested Supported Systems
- Requirements
- Building the Samples
- Installation
- Preprocessed Header
- Included Unit Tests
- Bug Reporting
Purpose
The purpose of this software package is to provide a simple to use standardized way of tiling tensors to allow performance optimization in heterogenous machines. Tiling is one of the pillars of performance and the Tensor Tiling Library is intended to provide a standardized approach.
The library is intended to be general purpose and usable on all architectures.
Currently the Tensor Tiling Library:
- passes its own unit tests
- contains reference implementations
- is optimized
Example
This is a double tiling example where the data is simultaneously moved from host<->device whilst the compute is occurring.
__kernel
void TTL_double_buffering(
__global uchar *restrict ext_base_in,
int external_stride_in,
__global uchar *restrict ext_base_out,
int external_stride_out,
int width,
int height, int tile_width, int tile_height) {
l_in1, l_in2, ext_input_tensor, &import_DB_e,
TTL_get_tile(0, tiler));
TTL_export_double_buffering_t export_db =
compute(imported_to, exported_from);
}
}
static void TTL_finish_buffering(TTL_import_double_const_void_tensor_buffering_t *import_double_buffering)
static TTL_int_void_sub_tensor_t TTL_step_buffering(TTL_import_double_const_void_tensor_buffering_t *const db, const TTL_tile_t next_tile)
Wait for the previous import operation to complete before beginning an import of the next tile.
static TTL_export_double_const_void_tensor_buffering_t TTL_start_export_double_buffering(__local void *int_base1, __local void *int_base2, TTL_ext_void_tensor_t ext_tensor, TTL_event_t *event)
Create a TTL_export_double_buffering_t and begin the buffering process.
static TTL_import_double_const_void_tensor_buffering_t TTL_start_import_double_buffering(__local void *int_base1, __local void *int_base2, TTL_const_ext_void_tensor_t ext_tensor, TTL_event_t *event, TTL_tile_t first_tile)
Create a TTL_import_double_buffering_t and begin the buffering process.
TTL_const_ext_void_tensor_t TTL_const_ext_tensor_t
TTL_ext_void_tensor_t TTL_ext_tensor_t
static TTL_ext_void_tensor_t TTL_create_ext_tensor(__global void *base, const TTL_shape_t shape, const TTL_layout_t layout, const TTL_offset_t offset, const TTL_dim_t elem_size)
const and non-const tensor creation functions.
static TTL_const_ext_void_tensor_t TTL_create_const_ext_tensor(__global const void *base, const TTL_shape_t shape, const TTL_layout_t layout, const TTL_offset_t offset, const TTL_dim_t elem_size)
create TTL_create_int_tensor_impl
TTL_int_void_sub_tensor_t TTL_int_sub_tensor_t
static TTL_layout_t TTL_create_layout(void)
Create a 1D Description of a Tensor layout in memory.
static TTL_tile_t TTL_get_tile(const int tile_id, const TTL_tiler_t tiler)
Return the tile_id'th tile of a tile array in row-major order.
static int TTL_number_of_tiles(TTL_tiler_t tiler)
Return the number of tiles that this tile can produce.
static TTL_tiler_t TTL_create_tiler(const TTL_shape_t shape, const TTL_shape_t tile)
event_t TTL_event_t
TTL_event_t is a pseudonym for OpenCL event_t.
#define __global
The opencl __global namespace is not supported in C.
#define __local
The opencl __local namespace is not supported in C.
unsigned char uchar
opencl and so TTL supports a type called uchar which is not part of C
static TTL_event_t TTL_get_event()
Return an empty event of type TTL_event_t.
Description of a Tensor layout in memory.
TTL_tiler_t is the basic unit that describes how a tile is subdivided.
static TTL_shape_t TTL_create_shape(TTL_dim_t width, TTL_dim_t height, TTL_dim_t depth)
Create a description of a Shape.
Doxygen
Doxygen is supported and can be build using the scripts/generate_doxygen.sh script. It is built automatically by github for main and the latest version can be found at https://github.khronos.org/OpenCL-TTL/
Building And Executing
The sample implementation builds under POCL on x86 and demonstrates a number of buffering schemes.
It builds using CMake and has been tested on Linux.
CMake
Tested Supported Systems
Requirements
- Python 2.x (Tested with Python 2.7)
- CMake 2.8.12 or higher (should be in PATH)
Building the Samples
- Install PoCL or another environment of your choice.
- Install pyopencl
Execute the following commands:
$ cd opencl/samples
$ ./TTL_sample_runner.py *.cl
$ cd ../c/samples
$ ./TTL_sample_runner.py *.c
Installation
See INSTALL.
Preprocessed Header
TTL it very type strong and uses a lot of macros to create many variants of each method. Whilst powerful this can make debugging tricker. For these reason the ability of preprocess TTL to stdout or the file provided.
$ scripts/preprocess.sh [-f TTL_opencl.h] [-t opencl]
$ scripts/preprocess.sh [-f TTL_c.h] [-t c]
-f defaults to /dev/stdout -t defaults to opencl
Will output a processed, clang-formatted file to stdout of the given file. Replacing TTL.h with this file, can make life easier. See also the TTL_PRE_GENERATE option in INSTALL.
Included Unit Tests
See the test README.
Bug Reporting
Bug reports can be reported by filing an issue on GitHub.