Efficient Processing of Deep Neural Networks. Vivienne Sze

Чтение книги онлайн.

Читать онлайн книгу Efficient Processing of Deep Neural Networks - Vivienne Sze страница 14

Efficient Processing of Deep Neural Networks - Vivienne Sze Synthesis Lectures on Computer Architecture

Скачать книгу

60k weights and 341k multiply-and-accumulates (MACs) per image. LeNet led to CNNs’ first commercial success, as it was deployed in ATMs to recognize digits for check deposits.

      Overfeat [72] has a very similar architecture to AlexNet with five CONV layers followed by three FC layers. The main differences are that the number of filters is increased for layers 3 (384 to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not split into two groups, the first FC layer only has 3072 channels rather than 4096, and the input size is 231×231 rather than 227×227. As a result, the number of weights grows to 146M and the number of MACs grows to 2.8G per image. Overfeat has two different models: fast (described here) and accurate. The accurate model used in the ImageNet Challenge gives a 0.65% lower Top-5 error rate than the fast model at the cost of 1.9× more MACs.

      VGG-16 [73] goes deeper to 16 layers consisting of 13 CONV layers followed by 3 FC layers. In order to balance out the cost of going deeper, larger filters (e.g., 5×5) are built from multiple smaller filters (e.g., 3×3), which have fewer weights, to achieve the same effective receptive fields, as shown in Figure 2.9a. As a result, all CONV layers have the same filter size of 3×3. In total, VGG-16 requires 138M weights and 15.5G MACs to process one 224×224 input image. VGG has two different models: VGG-16 (described here) and VGG-19. VGG-19 gives a 0.1% lower Top-5 error rate than VGG-16 at the cost of 1.27× more MACs.

image

      Figure 2.8: An example of dividing feature map into two grouped convolutions. Each filter requires 2× fewer weights and multiplications.

image

      Figure 2.9: Decomposing larger filters into smaller filters.

image

      Figure 2.10: Inception module from GoogLeNet [74] with example channel lengths. Note that each CONV layer is followed by a ReLU (not drawn).

      ResNet [24], also known as Residual Net, uses feed-forward connections that connects to layers beyond the immediate next layer (often referred to as residual, skip or identity connections); these connections enable a DNN with many layers (e.g., 34 or more) to be trainable. It was the first entry CNN in ImageNet Challenge that exceeded human-level accuracy with a Top-5 error rate below 5%. One of the challenges with deep networks is the vanishing gradient during training [78]; as the error backpropagates through the network the gradient shrinks, which affects the ability to update the weights in the earlier layers for very deep networks. ResNet introduces a “shortcut” module which contains an identity connection such that the weight layers (i.e., CONV layers) can be skipped, as shown in Figure 2.12. Rather than learning the function for the weight layers F(x), the shortcut module learns the residual mapping (F(x) = H(x) − x). Initially, F(x) is zero and the identity connection is taken; then gradually during training, the actual forward connection through the weight layer is used. ResNet also uses the “bottleneck” approach of using 1×1 filters to reduce the number of weights. As a result, the two layers in the shortcut module are replaced by three layers (1×1, 3×3, 1×1) where the first 1×1 layer reduces the number of activations and thus weights in the 3×3 layer, the last 1×1 layer restores the number of activations in the output of the third layer. ResNet-50 consists of one CONV layer, followed by 16 shortcut layers (each of which are 3 CONV layers deep), and 1 FC layer; it requires 25.5M weights and 3.9G MACs per image. There are various versions of ResNet with multiple depths (e.g., without bottleneck: 18, 34; with bottleneck: 50, 101, 152). The ResNet with 152 layers was the winner of the ImageNet Challenge requiring 11.3G MACs and 60M weights. Compared to ResNet-50, it reduces the Top-5 error by around 1% at the cost of 2.9× more MACs and 2.5× more weights.

image

      Figure 2.11: Apply 1×1×C filter (usually referred to as 1×1) to capture cross-channel correlation, but no spatial correlation. This bottleneck approach reduces the number of channels in next layer assuming the number of filters applied (M) is less than the original number of channels (C).

image

      Figure 2.12: Shortcut module from ResNet [24]. Note that ReLU following last CONV layer in shortcut is after the addition.

      Several trends can be observed in the popular CNNs shown in Table 2.2. Increasing the depth of the network tends to provide higher accuracy. Controlling for number of

Скачать книгу