Efficient Processing of Deep Neural Networks. Vivienne Sze
Чтение книги онлайн.
Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 17
1 The DNN research community often refers to the shape and size of a DNN as its “network architecture.” However, to avoid confusion with the use of the word “architecture” by the hardware community, we will talk about “DNN models” and their shape and size in this book.
2 CONV layers use a specific type of weight sharing, which will be described in Section 2.4.
3 Connections can come from the immediately preceding layer or an earlier layer. Furthermore, connections from a layer can go to multiple later layers.
4 For simplicity, in this chapter, we will refer to an array of partial sums as an output feature map. However, technically, the output feature map would be composed the values of the partial sums after they have gone through a nonlinear function (i.e., the output activations).
5 In some literature, K is used rather than M to denote the number of 3-D filters (also referred to a kernels), which determines the number of output feature map channels. We opted not to use K to avoid confusion with yet other communities that use it to refer to the number of dimensions. We also have adopted the convention of using P and Q as the dimensions of the output to align with other publications and since our prior use of E and F caused an alias with the use of “F” to represent filter weights. Note that some literature also use X and Y to denote the spatial dimensions of the input rather than W and H.
6 Note that many of the values in the CONV layer tensors are zero, making the tensors sparse. The origins of this sparsity, and approaches for performing the resulting sparse tensor algebra, are presented in Chapter 8.
7 Note that Albert Einstein popularized a similar notation for tensor algebra which omits any explicit specification of the summation variable.
8 In addition to being simple to implement, ReLU also increases the sparsity of the output activations, which can be exploited by a DNN accelerator to increase throughput, reduce energy consumption and reduce storage cost, as described in Section 8.1.1.
9 In the literature, this is often referred to dense prediction.
10 There are two versions of unpooling: (1) zero insertion is applied in a regular pattern, as shown in Figure 2.6a [60]—this is most commonly used; and (2) unpooling is paired with a max pooling layer, where the location of the max value during pooling is stored, and during unpooling the location of the non-zero value is placed in the location of the max value before pooling [61].
11 It has been recently reported that the reason batch normalization enables faster and more stable training is due to the fact that it makes the optimization landscape smoother resulting in more predictive and stable behavior of the gradient [67]; this is in contrast to the popular belief that batch normalization stabilizes the distribution of the input across layers. Nonetheless, batch normalization continues to be widely used for training and thus needs to be supported during inference.
12 During training, parameters δ and μ are computed per batch, and β and β are updated per batch based on the gradient; therefore, training for different batch sizes will result in different δ and μ parameters, which can impact accuracy. Note that each channel has its own set of δ, μ, β, and β parameters. During inference, all parameters are fixed, where δ and μ are computed from the entire training set. To avoid performing an extra pass over the entire training set to compute δ and μ, δ and μ are usually implemented as the running average of the per batch δ and μ computed during training.
13 Note variants of the up CONV layer with different types of upsampling include deconvolution layer, sub-pixel or fractional convolutional layer, transposed convolutional layer, and backward convolution layer [69].
14 This grouped convolution approach is applied more aggressively when performing co-design of algorithms and hardware to reduce complexity, which will be discussed in Chapter 9.
15 v2 is very similar to v3.
16 Note that in some parts of the book we use Top-1 and Top-5 error. The error can be computed as 100% minus accuracy.
17 This was demonstrated on Google’s internal JFT-300M dataset with 300M images and 18,291 classes, which is two orders of magnitude larger than ImageNet. However, performing four iterations across the entire training set using 50 K-80 GPUs required two months of training, which further emphasizes that compute is one of the main bottlenecks in the advancement of DNN research.
PART II
Design of Hardware for Processing DNNs
CHAPTER 3
Key Metrics and Design Objectives
Over the past few years, there has been a significant amount of research on efficient processing of DNNs. Accordingly, it is important to discuss the key metrics that one should consider when comparing and evaluating the strengths and weaknesses of different designs and proposed techniques and that should be incorporated into design considerations. While efficiency is often only associated with the number of operations per second per Watt (e.g., floating-point operations per second per Watt as FLOPS/W or tera-operations per second per Watt as TOPS/W), it is actually composed of many more metrics including accuracy, throughput, latency, energy consumption, power consumption, cost, flexibility, and scalability. Reporting a comprehensive set of these metrics is important in order to provide a complete picture of the trade-offs made by a proposed design or technique.
In this chapter, we will
• discuss the importance of each of these metrics;
• breakdown the factors that affect each metric. When feasible, present equations that describe the relationship between the factors and the metrics;
• describe how these metrics can be incorporated into design considerations for both the DNN hardware and the DNN model (i.e., workload); and