Efficient Processing of Deep Neural Networks. Vivienne Sze
Чтение книги онлайн.
Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 22
To maintain efficiency, the hardware should not rely on certain properties of the DNN models to achieve efficiency, as the properties cannot be guaranteed. For instance, a DNN accelerator that can efficiently support the case where the entire DNN model (i.e., all the weights) fits on-chip may perform extremely poorly when the DNN model grows larger, which is likely given that the size of DNN models continue to increase over time, as discussed in Section 2.4.1; a more flexible processor would be able to efficiently handle a wide range of DNN models, even those that exceed on-chip memory.
The degree of flexibility provided by a DNN accelerator is a complex trade-off with accelerator cost. Specifically, additional hardware usually needs to be added in order to flexibly support a wider range of workloads and/or improve their throughput and energy efficiency. We all know that specialization improves efficiency; thus, the design objective is to reduce the overhead (e.g., area cost and energy consumption) of supporting flexibility while maintaining efficiency across the wide range of DNN models. Thus, evaluating flexibility would entail ensuring that the extra hardware is a net benefit across multiple workloads.
Flexibility has become increasingly important when we factor in the many techniques that are being applied to the DNN models with the promise to make them more efficient, since they increase the diversity of workloads that need to be supported. These techniques include DNNs with different network architectures (i.e., different layer shapes, which impacts the amount of required storage and compute and the available data reuse that can be exploited), as described in Chapter 9, different levels of precision (i.e., different number of bits for across layers and data types), as described in Chapter 7, and different degrees of sparsity (i.e., number of zeros in the data), as described in Chapter 8. There are also different types of DNN layers and computation beyond MAC operations (e.g., activation functions) that need to be supported.
Actually getting a performance or efficiency benefit from these techniques invariably requires additional hardware, because a simpler DNN accelerator design may not benefit from these techniques. Again, it is important that the overhead of the additional hardware does not exceed the benefits of these techniques. This encourages a hardware and DNN model co-design approach.
To date, exploiting the flexibility of DNN hardware has relied on mapping processes that act like static per-layer compilers. As the field moves to DNN models that change dynamically, mapping processes will need to dynamically adapt at runtime to changes in the DNN model or input data, while still maximally exploiting the flexibility of the hardware to improve efficiency.
In summary, to assess the flexibility of DNN processors, its efficiency (e.g., inferences per second, inferences per joule) should be evaluated on a wide range of DNN models. The MLPerf benchmarking workloads are a good start; however, additional workloads may be needed to represent efficient techniques such as efficient network architectures, reduced precision and sparsity. The workloads should match the desired application. Ideally, since there can be many possible combinations, it would also be beneficial to define the range and limits of DNN models that can be efficiently supported on a given platform (e.g., maximum number of weights per filter or DNN model, minimum amount of sparsity, required structure of the sparsity, levels of precision such as 8-bit, 4-bit, 2-bit, or 1-bit, types of layers and activation functions, etc.).
3.6 SCALABILITY
Scalability has become increasingly important due to the wide use cases for DNNs and emerging technologies used for scaling up not just the size of the chip, but also building systems with multiple chips (often referred to as chiplets) [123] or even wafer-scale chips [124]. Scalability refers to how well a design can be scaled up to achieve higher throughput and energy efficiency when increasing the amount of resources (e.g., the number of PEs and on-chip storage). This evaluation is done under the assumption that the system does not have to be significantly redesigned (e.g., the design only needs to be replicated) since major design changes can be expensive in terms of time and cost. Ideally, a scalable design can be used for low-cost embedded devices and high-performance devices in the cloud simply by scaling up the resources.
Ideally, the throughput would scale linearly and proportionally with the number of PEs. Similarly, the energy efficiency would also improve with more on-chip storage, however, this would be likely be nonlinear (e.g., increasing the on-chip storage such that the entire DNN model fits on chip would result in an abrupt improvement in energy efficiency). In practice, this is often challenging due to factors such as the reduced utilization of PEs and the increased cost of data movement due to long distance interconnects.
Scalability can be connected with cost efficiency by considering how inferences per second per cost (e.g., $) and inferences per joule per cost changes with scale. For instance, if throughput increases linearly with number of PEs, then the inferences per second per cost would be constant. It is also possible for the inferences per second per cost to improve super-linearly with increasing number of PEs, due to increased sharing of data across PEs.
In summary, to understand the scalability of a DNN accelerator design, it is important to report its performance and efficiency metrics as the number of PEs and storage capacity increases. This may include how well the design might handle technologies used for scaling up, such as inter-chip interconnect.
3.7 INTERPLAY BETWEEN DIFFERENT METRICS
It is important that all metrics are accounted for in order to fairly evaluate all the design tradeoffs. For instance, without the accuracy given for a specific dataset and task, one could run a simple DNN and easily claim low power, high throughput, and low cost—however, the processor might not be usable for a meaningful task; alternatively, without reporting the off-chip bandwidth, one could build a processor with only multipliers and easily claim low cost, high throughput, high accuracy, and low chip power—however, when evaluating system power, the off-chip memory access would be substantial. Finally, the test setup should also be reported, including whether the results are measured or obtained from simulation8 and how many images were tested.
In summary, the evaluation process for whether a DNN system is a viable solution for a given application might go as follows:
1. the accuracy determines if it can perform the given task;
2. the latency and throughput determine if it can run fast enough and in real time;
3. the energy and power consumption will primarily dictate the form factor of the device where the processing can operate;
4. the cost, which is primarily dictated by the chip area and external memory bandwidth requirements, determines how much one would pay for this solution;
5. flexibility determines the range of tasks it can support; and
6. the scalability determines whether the same design effort can be amortized for deployment in multiple domains, (e.g., in the cloud and at the edge), and if the system can efficiently be scaled with DNN model size.
1 Ideally, robustness and fairness should be considered in conjunction with accuracy, as there is also an interplay between these factors; however, these are areas of on-going research and beyond the scope of this book.
2 As an analogy, getting 9 out of 10