Deep Learning for Computer Vision with SAS. Robert Blanchard
Чтение книги онлайн.
Читать онлайн книгу Deep Learning for Computer Vision with SAS - Robert Blanchard страница 6
SAS includes a second variant of the MSRA, called MSRA2. Similar to the MSRA initialization, the MSRA2 method is a random Gaussian distribution with a standardization of
And it penalizes only for outgoing (fan-out) information.
Note: Weight initializations have less impact over model performance if batch normalization is used because batch normalization standardizes information passed between hidden layers. Batch normalization is discussed later in this chapter.
Consider the following simple example where unit y is being derived from 25 randomly initialized weights. The variance of unit y is larger when the standard deviation is held constant at 1. This means that the values for y are more likely to venture into a saturation region when a nonlinear activation function is incorporated. On the other hand, Xavier’s initialization penalizes the variance for the incoming and outgoing connections, constraining the value of y to less treacherous regions of the activation. See Figures 1.7 and 1.8, noting that these examples assume that there are 25 incoming and outgoing connections.
Figure 1.7: Constant Variance (Standard Deviation = 1)
Figure 1.8: Constant Variance (Standard Deviation =)
Regularization
Regularization is a process of introducing or removing information to stabilize an algorithm’s understanding of data. Regularizations such as early stopping, L1, and L2 have been used extensively in neural networks for many years. These regularizations are still widely used in deep learning, as well. However, there have been advancements in the area of regularization that work particularly well when combined with multi-hidden layer neural networks. Two of these advancements, dropout and batch normalization, have shown significant promise in deep learning models. Let’s begin with a discussion of dropout and then examine batch normalization.
Dropout adds noise to the learning process so that the model is more generalizable. Training an ensemble of deep neural networks with several hundred thousand parameters each might be infeasible. As seen in Figure 1.9, dropout adds noise to the learning process so that the model is more generalizable.
Figure 1.9: Regularization Techniques
The goal of dropout is to approximate an ensemble of many possible model structures through a process that perturbs the learning in an attempt to prevent weights from co-adapting. For example, imagine we are training a neural network to identify human faces, and one of the hidden units used in the model sufficiently captures the mouth. All other hidden units are now relying, at least in some part, on this hidden unit to help identify a face through the presence of the mouth. Removing the hidden unit that captures the mouth forces the remaining hidden units to adjust and compensate. This process pushes each hidden unit to be more of a “generalist” than a “specialist” because each hidden unit must reduce its reliance on other hidden units in the model.
During the process of dropout, hidden units or inputs (or both) are randomly removed from training for a period of weight updates. Removing the hidden unit from the model is as simple as multiplying the unit’s output by zero. The removed unit’s weights are not lost but rather frozen. Each time that units are removed, the resulting network is referred to as a thinned network. After several weight updates, all hidden and input units are returned to the network. Afterward, a new subset of hidden or input units (or both) are randomly selected and removed for several weight updates. The process is repeated until the maximum training iterations are reached or the optimization procedure converges.
In SAS Viya, you can specify the DROPOUT= option in an ADDLAYER statement to implement dropout. DROPOUT=ratio specifies the dropout ratio of the layer.
Below is an example of dropout implementation in an ADDLAYER statement.
AddLayer/model=’DLNN’ name=”HLayer1” layer={type=’FULLCONNECT’ n=30
act=’ELU’ init=’xavier’ dropout=.05} srcLayers={“data”};
Note: The ADDLAYER syntax is described shortly and further expanded upon throughout this book.
Batch Normalization
The batch normalization (Ioffe and Szegedy, 2015) operation normalizes information passed between hidden layers per mini-batch by performing a standardizing calculation to each piece of input data. The standardizing calculation subtracts the mean of the data and then divides by the standard deviation. It then follows this calculation by multiplying the data by the value of a learned constant and then adding the value of another learned constant.
Thus, the normalization formula is
where gamma
Some deep learning practitioners have dismissed the use of sigmoidal activations in the hidden units. Their dismissal might have been premature, however, with the discovery of batch normalization. Without batch normalization, each hidden layer is, in essence, learning from information that is constantly changing when multiple hidden layers are present in a neural network. That is, a weight update is reliant on second-order, third-order (and so on) effects (weights in the other layers). This phenomenon is known as the internal covariance shift (ICS) (Ioffe and Szegedy, 2015).
There are two schools of thought as to why batch normalization improves the learning process. The first comes from Ioffe and Szegedy who believe batch normalization reduces ICS. The second comes from Santurkar, Tsipras, Ilyas, and Madry who argue that batch normalization is not really reducing ICS but is instead smoothing the error landscape (Santurkar, Tsipras, Ilyas, and Madry 2018). Regardless of which thought prevails, batch normalization has empirically shown to improve the learning process and reduce neuron saturation.
In the SAS deep learning actions, batch normalization is implemented as a separate layer type and can be placed anywhere after the input layer and before the output layer.
Note: With regard to convolutional neural networks, the batch normalization layer is typically inserted after a convolution or pooling layer.
Batch Normalization with Mini-Batches
In the case where the source layer to a batch normalization layer contains feature maps, the batch normalization layer computes statistics based on all of the pixels in each feature map, over all of the observations in a mini-batch. For example, suppose that your network is configured for a mini-batch size of 3, and the input to the batch normalization layer consists of two 5 x 5 feature maps. In this case, the batch normalization layer computes two means and two standard deviations. The first mean would be the