Deep Learning for Computer Vision with SAS. Robert Blanchard

Чтение книги онлайн.

Читать онлайн книгу Deep Learning for Computer Vision with SAS - Robert Blanchard страница 4

Автор:
Жанр:
Серия:
Издательство:
Deep Learning for Computer Vision with SAS - Robert Blanchard

Скачать книгу

to a neural network provides little benefit without deep learning methods that underpin the efficient extraction of information. For example, SAS software has had the capability to build neural networks with many hidden layers using the NEURAL procedure for several decades. However, a case can be made to suggest that SAS has not had deep learning because the key elements that enable learning to persist in the presence of many hidden layers had not been discovered. These elements include the use of the following:

      ● activation functions that are more resistant to saturation than conventional activation functions

      ● fast moving gradient-based optimizations such as Stochastic Gradient Descent and ADAM

      ● weight initializations that consider the amount of incoming information

      ● new regularization techniques such as dropout and batch normalization

      ● innovations in distributed computing.

      The elements outlined above are included in today’s SAS software and are described below. Needless to say, deep learning has shown impressive promise in solving problems that were previously considered infeasible to solve.

      The process of deep learning is to formulate an outcome from engineering new glimpses of the input space, and then reengineering these engineered projections with the next hidden layer. This process is repeated for each hidden layer until the output layers are reached. The output layers reconcile the final layer of incoming hidden unit information to produce a set of outputs. The classic example of this process is facial recognition. The first hidden layer captures shades of the image. The next hidden layer combines the shades to formulate edges. The next hidden layer combines these edges to create projections of ears, mouths, noses, and other distinct aspects that define a human face. The next layer combines these distinct formulations to create a projection of a more complete human face. And so on. A brief comparison of traditional neural networks and deep learning is shown in Table 1.1.

      Table 1.1: Traditional Neural Networks versus Deep Learning

AspectTraditionalDeep Learning
Hidden activationfunction(s)Hyperbolic Tangent (tanh)Rectified Linear (ReLU)and other variants
Gradient-basedlearningBatch GD andBFGSStochastic GD,Adam, and LBFGS
Weight initializationConstant VarianceNormalized Variance
RegularizationEarly Stopping, L1,and L2Early Stopping, L1, L2,Dropout, and BatchNormalization
ProcessorCPUCPU or GPU

      Deep learning incorporates activation functions that are more resistant to neuron saturation than conventional activation functions. One of the classic characteristics of traditional neural networks was the infamous use of sigmoidal transformations in hidden units. Sigmoidal transformations are problematic for gradient-based learning because the sigmoid has two asymptotic regions that can saturate (that is, gradient of the output is near zero). The red or deeper shaded outer areas represent areas of saturation. See Figure 1.2.

      Figure 1.2: Hyperbolic Tangent Function

      On the other hand, a linear transformation such as an identity poses little issue for gradient-based learning because the gradient is a constant. However, the use of linear transformations negates the benefits provided by nonlinear transformations (that is, approximate nonlinear relationships).

      Rectified linear transformation (or ReLU) consists of piecewise linear transformations that, when combined, can approximate nonlinear functions. (See Figure 1.3.)

      Figure 1.3: Rectified Linear Function

      In the case of ReLU, the derivative for the active region output by the transformation is 1 and 0 for the inactive region. The inactive region of the ReLU transformation can be viewed as a weakness of the transformation because it inhibits the unit from contributing to gradient-based learning.

      The saturation of ReLU could be somewhat mitigated by cleverly initializing the weights to avoid negative output values. For example, consider a business scenario of modeling image data. Each unstandardized input pixel value ranges between 0 and 255. In this case, the weights could be initialized and constrained to be strictly positive to avoid negative output values, avoiding the non-active output region of the ReLU.

      Other variants of the rectified linear transformation exist that permit learning to continue when the combination function resolves to a negative value. Most notable of these is the exponential linear activation transformation (ELU) as shown in Figure 1.4.

      Figure 1.4: Exponential Linear Function

      SAS researchers have observed better performance when ELU is used instead of ReLU in convolutional neural networks in some cases. SAS includes other, popular activation functions that are not shown here, such as softplus and leaky. Additionally, you can create your own activation functions in SAS using the SAS Function Compiler (or FCMP).

      Note: Convolutional neural networks (CNNs) are a class of artificial neural networks. CNNs are widely used in image recognition and classification. Like regular neural networks, a CNN consists of multiple layers and a number of neurons. CNNs are well suited for image data, but they can also be used for other problems such as natural language processing. CNNs are detailed in Chapter 2.

      The error function defines a surface in the parameter space. If it is a linear model fit by least squares, the error surface is convex with a unique minimum. However, in a nonlinear model, this error surface is often a complex landscape consisting of numerous deep valleys, steep cliffs, and long-reaching plateaus.

      To efficiently search this landscape for an error minimum, optimization must be used. The optimization methods use local features of the error surface to guide their descent. Specifically,

      the parameters associated with a given error minimum are located using the following procedure:

      1. Initialize the weight vector to small random values, w(0).

      2. Use an optimization method to determine the update vector, δ(t).

      3. Add the update vector to the weight values from the previous iteration to generate new estimates:

      4. If none of the specified convergence criteria have been achieved, then go back to step 2.

      Here are the three conditions under which convergence is declared:

      1. when the specified error function stops improving

      2. if the gradient has no slope (implying that a minimum has been reached)

      3. if the magnitude of the parameters stops changing substantially

      Re-invented several times, the back propagation (backprop) algorithm initially just used gradient descent to determine an appropriate set of weights. The gradient,, is the vector of partial derivatives of the error function with respect to the weights. It points in the steepest direction uphill. (See Figure 1.5.)

      Figure

Скачать книгу