Multi-Processor System-on-Chip 1. Liliana Andrade

Чтение книги онлайн.

Читать онлайн книгу Multi-Processor System-on-Chip 1 - Liliana Andrade страница 11

Multi-Processor System-on-Chip 1 - Liliana Andrade

Скачать книгу

      IoT edge devices that use machine learning inference typically perform different types of processing, as shown in Figure 1.2.

      Figure 1.2. Different types of processing in machine learning inference applications

      These devices typically perform some pre-processing and feature extraction on the sensor input data before performing the actual neural network processing for the trained model. For example, a smart speaker with voice control capabilities may first pre-process the voice signal by performing acoustic echo cancellation and multimicrophone beam-forming. It may then apply FFTs to extract the spectral features for use in the neural network processing, which has been trained to recognize a vocabulary of voice commands.

      1.3.1.1. Neural network processing

      For each layer in a neural network, the input data must be transformed into output data. An often used transformation is the convolution, which convolves, or, more precisely, correlates, the input data with a set of trained weights. This transformation is used in convolutional neural networks (CNNs), which are often applied in image or video recognition.

      Figure 1.3. 2D convolution applying a weight kernel to input data to calculate a value in the output map

      Input and output maps are often three-dimensional. That is, they have a width, a height and a depth, with different planes in the depth dimension typically referred to as channels. For input maps with a depth > 1, an output value can be calculated using a dot-product operation on input data from multiple input channels. For output maps with a depth > 1, a convolution must be performed for each output channel, using different weight kernels for different output channels. Depthwise convolution is a special convolution layer type for which the number of input and output channels is the same, with each output channel being calculated from the one input channel with the same depth value as the output channel. Yet another layer type is the fully connected layer, which performs a dot-product operation for each output value using the same number of weights as the number of samples in the input map.

      The key operation in the layer types described above is the dot-product operation on input samples and weights. It is therefore a requirement for a processor to implement such dot-product operations efficiently. This involves efficient computation, for example, using MAC instructions, as well as efficient access to input data, weight kernels and output data.

      where xt is the frame t in the input sequence, ht is the output for xt, Wx and Wh are weight sets, b is a bias, and f() is an output activation function. Thus, the calculation of an output involves a dot product of one set of weights with new input data and another dot product of another set of weights with the previous output data. Therefore, also for RNNs, the dot product is a key operation that must be implemented efficiently. The long short-term memory (LSTM) cell is another well-known RNN cell. The LSTM cell has a more complicated structure than the basic RNN cell that we discussed above, but the dot product is again a dominant operation.

      Activation functions are used in neural networks to transform data by performing some nonlinear mapping. Examples are rectified linear units (ReLU), sigmoid and hyperbolic tangent (TanH). The activation functions operate on a single data value and produce a single result. Hence, for an activation layer, the size of the output map is equal to the size of the input map.

      Figure 1.4. Example pooling operations: max pooling and average pooling

      Neural networks may also have pooling layers that transform an input map into a smaller output map by calculating single output values for (small) regions of the input data. Figure 1.4 shows two examples: max pooling and average pooling. Effectively, the pooling layers downsample the data in the width and height dimensions. The depth of the output map is the same as the depth of the input map.

      To obtain sufficiently accurate results when implementing machine learning inferences, appropriate data types must be used. During the training phase, data and weights are typically represented as 32-bit floating-point data. The common practice for deploying models for inference in embedded systems is to work with integer or fixed-point representations of quantized data and weights (Jacob et al. 2017). The potential impact of quantization errors can be taken into account in the training phase to avoid a notable effect on the model performance. Elements of input maps, output maps and weight kernels can typically be represented using 16-bit or smaller data types. For example, in a voice application, data samples are typically represented by 16-bit data types. In image or video applications, a 12-bit or smaller data type is often sufficient for representing the input samples. Precision requirements can differ per layer of the neural network. Special attention should be paid to the data types of the accumulators that are used to sum up the partial results when performing dot-product operations. These accumulators should be wide enough to avoid overflow of the (intermediate) results of the dot-product operations performed on the (quantized) weights and input samples.

      Memory requirements for low/mid-end machine learning inference are typically modest, thanks to limited input data rates and the use of neural networks with limited complexity. Input and output maps are often of limited size, i.e. a few tens of kBs or less, and the number and size of the weight kernels are also relatively small. The use of the smallest possible data types for input maps, output maps and weight kernels helps us to reduce memory requirements.

      In summary, low/mid-end machine learning inference

Скачать книгу