Efficient Processing of Deep Neural Networks. Vivienne Sze

Чтение книги онлайн.

Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 18

Efficient Processing of Deep Neural Networks - Vivienne Sze Synthesis Lectures on Computer Architecture

Скачать книгу

specify what should be reported for a given metric to enable proper evaluation.

      Finally, we will provide a case study on how one might bring all these metrics together for a holistic evaluation of a given approach. But first, we will discuss each of the metrics.

      Accuracy is used to indicate the quality of the result for a given task. The fact that DNNs can achieve state-of-the-art accuracy on a wide range of tasks is one of the key reasons driving the popularity and wide use of DNNs today. The units used to measure accuracy depend on the task. For instance, for image classification, accuracy is reported as the percentage of correctly classified images, while for object detection, accuracy is reported as the mean average precision (mAP), which is related to the trade off between the true positive rate and false positive rate.

      Achieving high accuracy on difficult tasks or datasets typically requires more complex DNN models (e.g., a larger number of MAC operations and more distinct weights, increased diversity in layer shapes, etc.), which can impact how efficiently the hardware can process the DNN model.

      Throughput is used to indicate the amount of data that can be processed or the number of executions of a task that can be completed in a given time period. High throughput is often critical to an application. For instance, processing video at 30 frames per second is necessary for delivering real-time performance. For data analytics, high throughput means that more data can be analyzed in a given amount of time. As the amount of visual data is growing exponentially, high-throughput big data analytics becomes increasingly important, particularly if an action needs to be taken based on the analysis (e.g., security or terrorist prevention; medical diagnosis or drug discovery). Throughput is often generically reported as the number of operations per second. In the case of inference, throughput is reported as inferences per second or in the form of runtime in terms of seconds per inference.

      Latency measures the time between when the input data arrives to a system and when the result is generated. Low latency is necessary for real-time interactive applications, such as augmented reality, autonomous navigation, and robotics. Latency is typically reported in seconds.

      There are several factors that affect throughput and latency. In terms of throughput, the number of inferences per second is affected by

image

      where the number of operations per second is dictated by both the DNN hardware and DNN model, while the number of operations per inference is dictated by the DNN model.

      When considering a system comprised of multiple processing elements (PEs), where a PE corresponds to a simple or primitive core that performs a single MAC operation, the number of operations per second can be further decomposed as follows:

image

      The first term reflects the peak throughput of a single PE, the second term reflects the amount of parallelism, while the last term reflects degradation due to the inability of the architecture to effectively utilize the PEs.

      Since the main operation for processing DNNs is a MAC, we will use number of operations and number of MAC operations interchangeably.

      One can increase the peak throughput of a single PE by increasing the number of cycles per second, which corresponds to a higher clock frequency, by reducing the critical path at the circuit or micro-architectural level, or the number of cycles per operations, which can be affected by the design of the MAC (e.g., a non-pipelined multi-cycle MAC would have more cycles per operation).

      While the above approaches increase the throughput of a single PE, the overall throughput can be increased by increasing the number of PEs, and thus the maximum number of MAC operations that can be performed in parallel. The number of PEs is dictated by the area density of the PE and the area cost of the system. If the area cost of the system is fixed, then increasing the number of PEs requires either increasing the area density of the PE (i.e., reduce the area per PE) or trading off on-chip storage area for more PEs. Reducing on-chip storage, however, can affect the utilization of the PEs, which we will discuss next.

      Increasing the density of PEs can also be achieved by reducing the logic associated with delivering operands to a MAC. This can be achieved by controlling multiple MACs with a single piece of logic. This is analogous to the situation in instruction-based systems such as CPUs and GPUs that reduce instruction bookkeeping overhead by using large aggregate instructions (e.g., single-instruction, multiple-data (SIMD)/Vector Instructions; single-instruction, multiple-threads (SIMT)/Tensor Instructions), where a single instruction can be used to initiate multiple operations.

      The number of PEs and the peak throughput of a single PE only indicate the theoretical maximum throughput (i.e., peak performance) when all PEs are performing computation (100% utilization). In reality, the achievable throughput depends on the actual utilization of those PEs, which is affected by several factors as follows:

image

      The first term reflects the ability to distribute the workload to PEs, while the second term reflects how efficiently those active PEs are processing the workload.

      The number of active PEs is the number of PEs that receive work; therefore, it is desirable to distribute the workload to as many PEs as possible. The ability to distribute the

Скачать книгу