Deep Learning for Computer Vision with SAS. Robert Blanchard
Чтение книги онлайн.
Читать онлайн книгу Deep Learning for Computer Vision with SAS - Robert Blanchard страница 4
● activation functions that are more resistant to saturation than conventional activation functions
● fast moving gradient-based optimizations such as Stochastic Gradient Descent and ADAM
● weight initializations that consider the amount of incoming information
● new regularization techniques such as dropout and batch normalization
● innovations in distributed computing.
The elements outlined above are included in today’s SAS software and are described below. Needless to say, deep learning has shown impressive promise in solving problems that were previously considered infeasible to solve.
The process of deep learning is to formulate an outcome from engineering new glimpses of the input space, and then reengineering these engineered projections with the next hidden layer. This process is repeated for each hidden layer until the output layers are reached. The output layers reconcile the final layer of incoming hidden unit information to produce a set of outputs. The classic example of this process is facial recognition. The first hidden layer captures shades of the image. The next hidden layer combines the shades to formulate edges. The next hidden layer combines these edges to create projections of ears, mouths, noses, and other distinct aspects that define a human face. The next layer combines these distinct formulations to create a projection of a more complete human face. And so on. A brief comparison of traditional neural networks and deep learning is shown in Table 1.1.
Table 1.1: Traditional Neural Networks versus Deep Learning
Aspect | Traditional | Deep Learning |
Hidden activationfunction(s) | Hyperbolic Tangent (tanh) | Rectified Linear (ReLU)and other variants |
Gradient-basedlearning | Batch GD andBFGS | Stochastic GD,Adam, and LBFGS |
Weight initialization | Constant Variance | Normalized Variance |
Regularization | Early Stopping, L1,and L2 | Early Stopping, L1, L2,Dropout, and BatchNormalization |
Processor | CPU | CPU or GPU |
Deep learning incorporates activation functions that are more resistant to neuron saturation than conventional activation functions. One of the classic characteristics of traditional neural networks was the infamous use of sigmoidal transformations in hidden units. Sigmoidal transformations are problematic for gradient-based learning because the sigmoid has two asymptotic regions that can saturate (that is, gradient of the output is near zero). The red or deeper shaded outer areas represent areas of saturation. See Figure 1.2.
Figure 1.2: Hyperbolic Tangent Function
On the other hand, a linear transformation such as an identity poses little issue for gradient-based learning because the gradient is a constant. However, the use of linear transformations negates the benefits provided by nonlinear transformations (that is, approximate nonlinear relationships).
Rectified linear transformation (or ReLU) consists of piecewise linear transformations that, when combined, can approximate nonlinear functions. (See Figure 1.3.)
Figure 1.3: Rectified Linear Function
In the case of ReLU, the derivative for the active region output by the transformation is 1 and 0 for the inactive region. The inactive region of the ReLU transformation can be viewed as a weakness of the transformation because it inhibits the unit from contributing to gradient-based learning.
The saturation of ReLU could be somewhat mitigated by cleverly initializing the weights to avoid negative output values. For example, consider a business scenario of modeling image data. Each unstandardized input pixel value ranges between 0 and 255. In this case, the weights could be initialized and constrained to be strictly positive to avoid negative output values, avoiding the non-active output region of the ReLU.
Other variants of the rectified linear transformation exist that permit learning to continue when the combination function resolves to a negative value. Most notable of these is the exponential linear activation transformation (ELU) as shown in Figure 1.4.
Figure 1.4: Exponential Linear Function
SAS researchers have observed better performance when ELU is used instead of ReLU in convolutional neural networks in some cases. SAS includes other, popular activation functions that are not shown here, such as softplus and leaky. Additionally, you can create your own activation functions in SAS using the SAS Function Compiler (or FCMP).
Note: Convolutional neural networks (CNNs) are a class of artificial neural networks. CNNs are widely used in image recognition and classification. Like regular neural networks, a CNN consists of multiple layers and a number of neurons. CNNs are well suited for image data, but they can also be used for other problems such as natural language processing. CNNs are detailed in Chapter 2.
The error function defines a surface in the parameter space. If it is a linear model fit by least squares, the error surface is convex with a unique minimum. However, in a nonlinear model, this error surface is often a complex landscape consisting of numerous deep valleys, steep cliffs, and long-reaching plateaus.
To efficiently search this landscape for an error minimum, optimization must be used. The optimization methods use local features of the error surface to guide their descent. Specifically,
the parameters associated with a given error minimum are located using the following procedure:
1. Initialize the weight vector to small random values, w(0).
2. Use an optimization method to determine the update vector, δ(t).
3. Add the update vector to the weight values from the previous iteration to generate new estimates:
4. If none of the specified convergence criteria have been achieved, then go back to step 2.
Here are the three conditions under which convergence is declared:
1. when the specified error function stops improving
2. if the gradient has no slope (implying that a minimum has been reached)
3. if the magnitude of the parameters stops changing substantially
Batch Gradient Descent
Re-invented several times, the back propagation (backprop) algorithm initially just used gradient descent to determine an appropriate set of weights. The gradient,
Figure