Efficient Processing of Deep Neural Networks. Vivienne Sze

Чтение книги онлайн.

Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 21

Efficient Processing of Deep Neural Networks - Vivienne Sze Synthesis Lectures on Computer Architecture

Скачать книгу

to not only report the energy efficiency and power consumption of the chip, but also the energy efficiency and power consumption of the off-chip memory (e.g., DRAM) or the amount of off-chip accesses (e.g., DRAM accesses) if no specific memory technology is specified; for the latter, it can be reported in terms of the total amount of data that is read and written off-chip per inference.

      Reducing the joules per MAC operation itself can be achieved by reducing the switching activity and/or capacitance at a circuit level or micro-architecture level. This can also be achieved by reducing precision (e.g., reducing the bit width of the MAC operation), as shown in Figure 3.3 and discussed in Chapter 7. Note that the impact of reducing precision on accuracy must also be considered.

      For instruction-based systems such as CPUs and GPUs, this can also be achieved by reducing instruction bookkeeping overhead. For example, using large aggregate instructions (e.g., single-instruction, multiple-data (SIMD)/Vector Instructions; single-instruction, multiple-threads (SIMT)/Tensor Instructions), a single instruction can be used to initiate multiple operations.

      Similar to the throughput metric discussed in Section 3.2, the number of operations per inference depends on the DNN model, however the operations per joules may be a function of the ability of the hardware to exploit sparsity to avoid performing ineffectual MAC operations. Equation (3.9) shows how operations per joule can be decomposed into:

      1. the number of effectual operations plus unexploited ineffectual operations per joule, which remains somewhat constant for a given hardware architecture design;

      2. the ratio of effectual operations over effectual operations plus unexploited ineffectual operations, which refers to the ability of the hardware to exploit ineffectual operations (ideally unexploited ineffectual operations should be zero, and this ratio should be one); and

      3. the number of effectual operations out of (total) operations, which is related to the amount of sparsity and depends on the DNN model.

image

      For hardware that can exploit sparsity, increasing the amount of sparsity (i.e., decreasing the number of effectual operations out of (total) operations) can increase the number of operations per joule, which subsequently increases inferences per joule, as shown in Equation (3.6). While exploiting sparsity has the potential of increasing the number of (total) operations per joule, the additional hardware will decrease the effectual operations plus unexploited ineffectual operations per joule. In order to achieve a net benefit, the decrease in effectual operations plus unexploited ineffectual operations per joule must be more than offset by the decrease of effectual operations out of (total) operations.

      In summary, we want to emphasize that the number of MAC operations and weights in the DNN model are not sufficient for evaluating energy efficiency. From an energy perspective, all MAC operations or weights are not created equal. This is because the number of MAC operations and weights do not reflect where the data is accessed and how much the data is reused, both of which have a significant impact on the operations per joule. Therefore, the number of MAC operations and weights is not necessarily a good proxy for energy consumption and it is often more effective to design efficient DNN models with hardware in the loop. Techniques for designing DNN models with hardware in the loop are discussed in Chapter 9.

      In order to evaluate the energy efficiency and power consumption of the entire system, it is critical to not only report the energy efficiency and power consumption of the chip, but also the energy efficiency and power consumption of the off-chip memory (e.g., DRAM) or the amount of off-chip accesses (e.g., DRAM accesses) if no specific memory technology is specified; for the latter, it can be reported in terms of the total amount of data that is read and written off-chip per inference. As with throughput and latency, the evaluation should be performed on clearly specified, ideally widely used, DNN models.

      One of the key factors that affect cost is the chip area (e.g., square millimeters, mm2) in conjunction with the process technology (e.g., 45 nm CMOS), which constrains the amount of on-chip storage and amount of compute (e.g., the number of PEs for custom DNN accelerators, the number of cores for CPUs and GPUs, the number of digital signal processing (DSP) engines for FPGAs, etc.). To report information related to area, without specifying a specific process technology, the amount of on-chip memory (e.g, storage capacity of the global buffer) and compute (e.g., number of PEs) can be used as a proxy for area.

      Another important factor is the amount of off-chip bandwidth, which dictates the cost and complexity of the packaging and printed circuit board (PCB) design (e.g., High Bandwidth Memory (HBM) [122] to connect to off-chip DRAM, NVLink to connect to other GPUs, etc.), as well as whether additional chip area is required for a transceiver to handle signal integrity at high speeds. The off-chip bandwidth, which is typically reported in gigabits per second (Gbps), sometimes including the number of I/O ports, can be used as a proxy for packaging and PCB cost.

      There is also an interplay between the costs attributable to the chip area and off-chip bandwidth. For instance, increasing on-chip storage, which increases chip area, can reduce off-chip bandwidth. Accordingly, both metrics should be reported in order to provide perspective on the total cost of the system.

      Of course reducing cost alone is not the only objective. The design objective is invariably to maximize the throughput or energy efficiency for a given cost, specifically, to maximize inferences per second per cost (e.g., $) and/or inferences per joule per cost. This is closely related to the previously discussed property of utilization; to be cost efficient, the design should aim to utilize every PE to increase inferences per second, since each PE increases the area and thus the cost of the chip; similarly, the design should aim to effectively utilize all the on-chip storage to reduce off-chip bandwidth, or increase operations per off-chip memory access as expressed by the roofline model (see Figure 3.1), as each byte of on-chip memory also increases cost.

      The merit of a DNN accelerator is also a function of its flexibility. Flexibility refers to the range of DNN models that can be supported on the DNN processor and the ability of the software environment (e.g., the mapper) to maximally exploit the capabilities of the hardware for any desired DNN model. Given the fast-moving pace of DNN research and deployment, it is increasingly important that DNN processors support a wide range of DNN models and tasks.

      We can define support in two tiers: the first tier requires that the hardware only needs to be able to functionally support different DNN models (i.e., the DNN model can run on the hardware). The second tier requires that the hardware should also maintain efficiency (i.e., high throughput

Скачать книгу