Multi-Processor System-on-Chip 1. Liliana Andrade

Чтение книги онлайн.

Читать онлайн книгу Multi-Processor System-on-Chip 1 - Liliana Andrade страница 20

Multi-Processor System-on-Chip 1 - Liliana Andrade

Скачать книгу

predication Full predication Main examples Multiflow TRACE processors Cydrome Cydra-5 HP Labs Lx / STMicroelectronics ST200 HP-Intel IA64 Philips TriMedia Texas Instruments VelociTI

       – Dismissible loads: these instructions enable control speculation of load instructions by suppressing exceptions on address errors, and by ensuring that no side-effects occur in the I/O areas. Additional configuration in the MMU refine their behavior on protection and no-mapping exceptions.

       – No rotating registers: rotating registers rename temporary variables defined inside software pipelines, whose schedule is built while ignoring register antidependences. However, rotating registers add significant ISA and implementation complexity, while temporary variable renaming can be done by the compiler.

       – Widened memory access: widening the memory accesses on a single port is simpler to implement than multiple memory ports, especially when memory address translation is implied. This simplification enables, in turn, the support of misaligned memory accesses, which significantly improves compiler vectorization opportunities.

       – Unification of the scalar and SIMD data paths around a main register file of 64×64-bit registers, for the same motivations as the POWER vector-scalar architecture (Gschwind 2016). Operands for the SIMD instructions map to register pairs (16 bytes) or to register quadruples (32 bytes).

      2.3.4. Coprocessor

Schematic illustration of tensor coprocessor data path.

      Figure 2.11. Tensor coprocessor data path

      The coprocessor data path is designed by assuming that the activations and weights, respectively, have row-major and column-major layout in memory, in order to avoid the complexities of Morton memory indexing (Rovder et al. 2019). Due to the mixed-precision arithmetic, matrix operands may take one, two or four consecutive registers, with element sizes of one, two, four and eight bytes. In all cases, the coprocessor operations interpret matrix operands as having four rows and a variable number of columns, depending on the number of consecutive registers and the element size. In order to support this invariant, four 32-byte “load-scatter” instructions are provided to coprocessor registers. A load-scatter instruction loads 32 consecutive bytes from memory, interprets these as four 64-bit (8 bytes) blocks and writes each block into a specified quarter of each register that composes the destination operand (Figure 2.12). After executing the four load-scatter variants, a 4×P submatrix of a matrix with row-major order in memory is loaded into a coprocessor register quadruple.

Schematic illustration of load-scatter to a quadruple register operand.

      Figure 2.12. Load-scatter to a quadruple register operand

Schematic illustration of INT8.32 matrix multiply-accumulate operation.

      Figure 2.13. INT8.32 matrix multiply-accumulate operation

      2.4.1. High-performance computing

       – an OpenCL device is an offloading target where computations are sent using a command queue;

       – an OpenCL device has a global memory allocated and managed by the host application, and shared by the multiple compute units of the OpenCL device;

       – an OpenCL compute unit comprises several processing elements (PEs) that share the compute unit local memory;

       – each OpenCL PE also has a private memory, and shared access to the device’s global memory without cache coherence across compute units.

      The OpenCL sub-devices are defined as non-intersecting sets of compute units inside a device, which have dedicated command queues while sharing the global memory.

      On the MPPA3 processor, high-performance computing functions are dispatched to partitions composed of one or more compute clusters, each of which is exposed as an OpenCL sub-device. In the port of the PoCL environment, support for OpenCL sub-devices has been developed, while two offloading modes are provided:

      LWI (Linearized Work Items): all the work items of a work group are executed within a loop on a single PE. This is the default execution mode of PoCL;

      SPMD (Single Program Multiple Data): the work items of a work group are executed concurrently on the PEs of a compute cluster, with the _ _local OpenCL memory space shared by the PEs and located into the SMEM (Figure 2.14).

Скачать книгу