Multi-Processor System-on-Chip 1. Liliana Andrade
Чтение книги онлайн.
Читать онлайн книгу Multi-Processor System-on-Chip 1 - Liliana Andrade страница 15
The performance data for processor A is published in (Lai et al. 2018) in terms of milliseconds for a processor running at a clock frequency of 216 MHz. The cycle counts for processor A in Table 1.4 have been calculated by multiplying the published millisecond numbers with this clock frequency. The CIFAR-10 CNN graph reported in (Lai et al. 2018) has the same convolution and pooling layers as listed in Table 1.3, but uses a single fully connected layer with a 4x4x64x10 filter shape to directly transform the 64x4x4 input map into 10 output values. This modification of the Caffe CNN graph reduces the size of the weight data considerably, but requires retraining of the graph. The impact on the total cycle count is marginal.
The performance data for the RISC-V processor published in (Croome 2018) reports a total of 1.5 Mcycles for executing the CIFAR-10 graph on a highly parallel 8-core RISC-V architecture. For calculating the total number of cycles on a single RISC-V core, we consider that the performance is highly dominated by the cycles spent on 5x5 convolutions, which constitute more than 98% of the compute operations in this graph. For these 5x5 convolutions, (Croome 2018) reports a speed-up from a 1-core system to an 8-core system of 18.5/2.2 = 8.2. Hence, a reasonable estimate for the total number of cycles on a single RISC-V core is 1.5x8.2 = 12.3 Mcycles.
Table 1.4. Performance data for the CIFAR-10 CNN graph
# | Layer type | ARC EM9D [ Mcycles ] | Processor A [ Mcycles ] | Processor B (RISC-V ISA) [ Mcycles ] |
0 | Permute | 0.01 | – | – |
1 | Convolution | 1.63 | 6.78 | – |
2 | Max Pooling | 0.14 | 0.34 | – |
3 | Convolution | 3.46 | 9.25 | – |
4 | Avg Pooling | 0.09 | 0.09 | – |
5 | Convolution | 1.76 | 4.88 | – |
6 | Avg Pooling | 0.07 | 0.04 | – |
7 | Fully-connected | 0.03 | 0.02 | – |
8 | Fully-connected | 0.001 | – | |
Total | 7.2 | 21.4 | 12.3 |
From Table 1.4, we conclude that the ARC EM9D processor spends 3x fewer cycles than processor A and 1.7x fewer cycles than the RISC-V core (processor B) for the same machine learning inference task, without using any specific accelerators. Thanks to the good cycle efficiency, the ARC EM9D processor can be clocked at a low frequency, which helps to save power in a smart IoT edge device.
1.4. Conclusion
Smart IoT edge devices that interact intelligently with their users are appearing in many application areas. These devices have diverse compute requirements, including a mixture of control processing, DSP and machine learning. Versatile processors are required to efficiently execute these different types of workloads. Furthermore, these processors must allow for easy customization to improve their efficiency for a specific application. Configurability and extensibility are two key mechanisms that provide such customization. Increasingly, IoT edge devices apply machine learning technology for processing captured sensor data, so that smart actions can be taken based on recognized patterns. We presented key processor features and a software library for the efficient implementation of low/mid-end machine learning inference. More specifically, we highlighted several processor capabilities, such as vector MAC instructions and XY memory with advanced AGUs, that are key to the efficient implementation of machine learning inference. The ARC EM9D processor is a universal processor for low-power IoT applications which is both configurable and extensible. The complete and highly optimized embARC MLI library makes effective use of the ARC EM9D processor to efficiently support a wide range of low/mid-end machine learning applications. We demonstrated this efficiency with excellent results for the CIFAR-10 benchmark.
1.5. References
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L.V., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning – Volume 48, ICML-16, 173–182.
Croome, M. (2018). Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery operated edge devices [Online]. Available at: https://content.riscv.org/wp-content/uploads/2018/07/Shanghai-1325_GreenWaves_Shanghai-2018-MC-V2.pdf.
Dutt, N. and Choi, K. (2003). Configurable processors for embedded computing. IEEE Computer, 36(1), 120–123.