Multi-Processor System-on-Chip 1. Liliana Andrade

Чтение книги онлайн.

Читать онлайн книгу Multi-Processor System-on-Chip 1 - Liliana Andrade страница 21

Multi-Processor System-on-Chip 1 - Liliana Andrade

Скачать книгу

mappings of the abstract OpenCL machine elements onto the MPPA3 architecture components are presented in Table 2.4. Although the LWI mode appears better suited to the OpenCL-C kernel code written for GPGPU processors, the SPMD mode is preferred for optimizing performance, as it allows the configuration of most of the compute cluster SMEM as OpenCL local memory shared by the work group.

Schematic illustration of OpenCL NDRange execution using the SPMD mode.
OpenCL Device Global memory Sub-device Compute unit
MPPA3 MPPA processor or External DDR Group of Compute cluster (SPMD)
Component MPPA domain memory compute cluster(s) Processing element (LWI)

      Table 2.4. OpenCL machine elements and MPPA architecture components

      Most often, there is a need to port C/C++ code and to access the high-performance features implemented in the GCC compiler for the Kalray VLIW core. Among these, the C named address space extension defined by ISO/IEC TR 18037:2008 is used to annotate objects and addresses that are accessed using non-temporal (L1D cache bypass) and/or non-trapping loads. In order to call the code compiled by GCC and the MPPA communication libraries (Hascoët et al. 2017) from OpenCL-C kernels, the LLVM OpenCL-C compiler and PoCL have been extended to understand function declarations annotated with _ _attribute_ _ ((mppa_native)). Whenever such reference is seen in OpenCL-C source code, the PoCL linking stages assumes that the symbol is resolved, and the MPPA3 compute cluster run-time environment dynamically loads and links the native function, before starting the execution of the kernel.

      This native function extension also enables kernels to access other services such as a lightweight lock-free POSIX multi-threading environment, fast inter-PE hardware synchronizations, dynamic local memory allocation and remoting of system calls to the host OS, including FILE I/O.

      2.4.2. KaNN code generator

Schematic illustration of KaNN inference code generator workflow.

      Figure 2.15. KaNN inference code generator workflow

      Following the import of the input model into an intermediate representation, optimizations are applied to the compute graph:

       – elimination of channel concatenation and slicing copies;

       – padding of input activations of convolutional layers;

       – folding of batch normalizations, scalings, additions, into a single pointwise fused multiply-add operator;

       – fusion of convolutions with ReLU activation functions;

       – adaptation of arithmetic representations.

       – In case of spatial splitting of the output activations, each compute cluster only accesses an input activation tile and its shadow region, while all the operator parameters are required; these are read once from the DDR memory and multicasted to all the target compute clusters.

       – In case of channel splitting of the output activations, the full input layer must be replicated in the local memory of each compute cluster, but only the corresponding slice of parameters is read from the DDR memory.

      In all cases, activations are computed once, laid out sequentially along the channel dimension and possibly copied to other local memories.

Schematic illustration of activation splitting across MPPA3 compute clusters.

      Figure 2.16. Activation splitting across MPPA3 compute clusters

      For any compute cluster in the target sub-device, the code generation process defines and implements a local schedule for:

       – local memory buffer allocations/deallocations;

       – DDR memory read/multicast of parameters;

       – execution of operator operations;

       – inter-cluster activation exchanges;

       – inter-cluster synchronizations.

      This process is backed by the computation graph (Figure 2.17) augmented with parameter read tasks (yellow) and activation production tasks (blue).

      The results of KaNN code generation is a collection of OpenCL binary kernels, where each kernel interprets the contents of a static data block composed of a sequence of records. Each record contains its length, a native compute function pointer and a structure containing arguments for the compute function. For each record, the OpenCL kernel calls the native compute function with the pointer to the structure. The kernel ends after the interpretation of the last record.

Schematic illustration of KaNN augmented computation graph.

Скачать книгу