Multi-Processor System-on-Chip 1. Liliana Andrade
Чтение книги онлайн.
Читать онлайн книгу Multi-Processor System-on-Chip 1 - Liliana Andrade страница 14
This example demonstrates the flexibility of AGUs for complex data addressing patterns, including 2D accesses using two modifiers for the input data as well as sign extension and replication of weights. A typical approach for calculating convolution layers, for example, as popularized by Caffe, is to use additional image-to-column (im2col) transformations. Although such transformations are helpful on some processors as they simplify subsequent calculations for performing the convolutions, this comes at a price of a significant overhead for performing these transformations. The advanced AGUs, as used in Figure 1.8, make these transformations obsolete, thereby supporting efficient embedded implementations.
Figure 1.8. Assembly code generated from MLI C-code for 2D convolution of 16-bit input data and 8-bit weights
From the user’s point of view, the embARC MLI library provides ease of use, allowing the construction of efficient machine learning inference engines without requiring in-depth knowledge of the processor architecture and software optimization details. The embARC MLI library provides a broad set of optimized functions, so that the user can concentrate on the application and write embedded code using familiar high-level constructs for machine learning inference.
1.3.4. Example machine learning applications and benchmarks
The embARC MLI library is available from embarc.org (embARC Open Software Platform 2019), together with a number of example applications that demonstrate the usage of the library, such as:
– CIFAR-10 low-resolution object classifier: CNN graph;
– face detection: CNN graph;
– human activity recognition (HAR): LSTM-based network;
– keyword spotting: graph with CNN and LSTM layers trained on the Google speech command dataset.
The CIFAR-10 (Krizhevsky 2009) example application is based on the Caffe (Jia et al. 2014) tutorial. The CIFAR-10 dataset is a set of 60,000 low-resolution RGB images (32x32 pixels) of objects in 10 classes, such as “cat”, “dog” and “ship”. This dataset is widely used as a “Hello World” example in machine learning and computer vision. The objective is to train the classifier using 50,000 of these images, so that the other 10,000 images of the dataset can be classified with high accuracy. We used the CIFAR-10 CNN graph in Figure 1.9 for training and inference. This graph matches the CIFAR-10 graph from the Caffe tutorial, including the two fully connected layers towards the end of the graph.
Figure 1.9. CNN graph of the CIFAR-10 example application
We used the CIFAR-10 example application with 8-bit for both feature data and weights to benchmark the performance of machine learning inference on the ARC EM9D processor. The code of this CIFAR-10 application, built using the embARC MLI library, is illustrated in Figure 1.10.
Figure 1.10. MLI code of the CIFAR-10 inference application
As the code in Figure 1.10 shows, each layer in the graph is implemented by calling a function from the embARC MLI library. Before executing the first convolution layer, we call a permute function from the embARC MLI library to transform the RGB image into CHW format so that neighboring data elements are from the same color plane. The code further shows that a ping-pong scheme with two buffers, ir_X and ir_Y, is used for buffering input and output maps.
A very similar CIFAR-10 CNN graph has been used by others for benchmarking machine learning inference on their embedded processors, with performance numbers published in (Lai et al. 2018) and (Croome 2018). Table 1.3 presents the model parameters of the CIFAR-10 CNN graph that we used, with performance data for the ARC EM9D processor and two other embedded processors presented in Table 1.4.
Table 1.3. Model parameters of the CIFAR-10 CNN graph
# | Layer type | Weights tensor shape | Output tensor shape | Coefficients |
0 | Permute | – | 3 × 32 × 32 | 0 |
1 | Convolution | 32 × 3 × 5 × 5 | 32 × 32 × 32 (32K) | 2400 |
2 | Max Pooling | – | 32 × 16 × 16 (8K) | 0 |
3 | Convolution | 32 × 32 × 5 × 5 | 32 × 16 × 16 (8K) | 25600 |
4 | Avg Pooling | - | 32 × 8 × 8 (2K) | 0 |
5 | Convolution | 64 × 32 × 5 × 5 | 64 × 8 × 8 (4K) | 51200 |
6 | Avg Pooling | – | 64 × 4 × 4 (1K) | 0 |
7 | Fully-connected | 64 × 1024 | 64 | 65536 |