Efficient Processing of Deep Neural Networks. Vivienne Sze

Чтение книги онлайн.

Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 19

Efficient Processing of Deep Neural Networks - Vivienne Sze Synthesis Lectures on Computer Architecture

Скачать книгу

is determined by the flexibility of the architecture, for instance the on-chip network, to support the layer shapes in the DNN model.

      Within the constraints of the on-chip network, the number of active PEs is also determined by the specific allocation of work to PEs by the mapping process. The mapping process involves the placement and scheduling in space and time of every MAC operation (including the delivery of the appropriate operands) onto the PEs. Mapping can be thought of as a compiler for the DNN hardware. The design of on-chip networks and mappings are discussed in Chapters 5 and 6.

      The utilization of the active PEs is largely dictated by the timely delivery of work to the PEs such that the active PEs do not become idle while waiting for the data to arrive. This can be affected by the bandwidth and latency of the (on-chip and off-chip) memory and network. The bandwidth requirements can be affected by the amount of data reuse available in the DNN model and the amount of data reuse that can be exploited by the memory hierarchy and dataflow. The dataflow determines the order of operations and where data is stored and reused. The amount of data reuse can also be increased using a larger batch size, which is one of the reasons why increasing batch size can increase throughput. The challenge of data delivery and memory bandwidth are discussed in Chapters 5 and 6. The utilization of the active PEs can also be affected by the imbalance of work allocated across PEs, which can occur when exploiting sparsity (i.e., avoiding unnecessary work associated with multiplications by zero); PEs with less work become idle and thus have lower utilization.

image

      Figure 3.1: The roofline model. The peak operations per second is indicated by the bold line; when the operation intensity, which dictates by amount of compute per byte of data, is low, the operations per second is limited by the data delivery. The design goal is to operate as close as possible to the peak operations per second for the operation intensity of a given workload.

      There is also an interplay between the number of PEs and the utilization of PEs. For instance, one way to reduce the likelihood that a PE needs to wait for data is to store some data locally near or within the PE. However, this requires increasing the chip area allocated to on-chip storage, which, given a fixed chip area, would reduce the number of PEs. Therefore, a key design consideration is how much area to allocate to compute (which increases the number of PEs) versus on-chip storage (which increases the utilization of PEs).

      The impact of these factors can be captured using Eyexam, which is a systematic way of understanding the performance limits for DNN processors as a function of specific characteristics of the DNN model and accelerator design. Eyexam includes and extends the well-known roofline model [119]. The roofline model, as illustrated in Figure 3.1, relates average bandwidth demand and peak computational ability to performance. Eyexam is described in Chapter 6.

      While the number of operations per inference in Equation (3.1) depends on the DNN model, the operations per second depends on both the DNN model and the hardware. For example, designing DNN models with efficient layer shapes (also referred to efficient network architectures), as described in Chapter 9, can reduce the number of MAC operations in the DNN model and consequently the number of operations per inference. However, such DNN models can result in a wide range of layer shapes, some of which may have poor utilization of PEs and therefore reduce the overall operations per second, as shown in Equation (3.2).

      A deeper consideration of the operations per second, is that all operations are not created equal and therefore cycles per operation may not be a constant. For example, if we consider the fact that anything multiplied by zero is zero, some MAC operations are ineffectual (i.e., they do not change the accumulated value). The number of ineffectual operations is a function of both the DNN model and the input data. These ineffectual MAC operations can require fewer cycles or no cycles at all. Conversely, we only need to process effectual (or non-zero) MAC operations, where both inputs are non-zero; this is referred to as exploiting sparsity, which is discussed in Chapter 8.

      Equation (3.4) shows how operations per cycle can be decomposed into

      1. the number of effectual operations plus unexploited ineffectual operations per cycle, which remains somewhat constant for a given hardware accelerator design;

      2. the ratio of effectual operations over effectual operations plus unexploited ineffectual operations, which refers to the ability of the hardware to exploit ineffectual operations (ideally unexploited ineffectual operations should be zero, and this ratio should be one); and

      3. the number of effectual operations out of (total) operations, which is related to the amount of sparsity and depends on the DNN model.

      As the amount of sparsity increases (i.e., the number of effectual operations out of (total) operations decreases), the operations per cycle increases, which subsequently increases operations per second, as shown in Equation (3.2):

image image

      However, exploiting sparsity requires additional hardware to identify when inputs are zero to avoid performing unnecessary MAC operations. The additional hardware can increase the critical path, which decreases cycles per second, and also reduce area density of the PE, which reduces the number of PEs for a given area. Both of these factors can reduce the operations per second, as shown in Equation (3.2). Therefore, the complexity of the additional hardware can result in a trade off between reducing the number of unexploited ineffectual operations and increasing critical path or reducing the number of PEs.

      Finally, designing hardware and DNN models that support reduced precision (i.e., fewer bits per operand and per operations), which is discussed in Chapter 7, can also increase the number of operations per second. Fewer bits per operand means that the memory bandwidth required to support a given operation is reduced,

Скачать книгу