Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 44

Computational Statistics in Data Science - Группа авторов

Скачать книгу

8 shows the information passing process. At t equals 1, network upper N takes in a random initialed vector bold-italic h Superscript left-parenthesis 0 right-parenthesis together with bold-italic x Superscript left-parenthesis 1 right-parenthesis and outputs bold-italic h Superscript left-parenthesis 1 right-parenthesis, and then at t equals 2, upper N takes in both bold-italic x Superscript left-parenthesis 2 right-parenthesis and bold-italic h Superscript left-parenthesis 1 right-parenthesis and outputs bold-italic h Superscript left-parenthesis 2 right-parenthesis. This process is repeated over all data points in the input sequence.

stat08316fgz007 stat08316fgz008

      Though multiple network blocks are shown on the right side of Figure 8, they share the same structure and weights. A simple example of the process can be written as

      (9)StartLayout 1st Row 1st Column bold-italic h Superscript left-parenthesis t right-parenthesis Baseline equals sigma left-parenthesis bold-italic upper W 1 bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W 2 bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b right-parenthesis 2nd Column Blank EndLayout

      where bold-italic upper W 1 and bold-italic upper W 2 are weight matrices of network upper N, sigma left-parenthesis dot right-parenthesis is an activation function, and bold-italic b is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output bold-italic h Superscript left-parenthesis upper T right-parenthesis can be passed into another network to make prediction. For a sequence‐to‐sequence model, ModifyingAbove bold-italic y With Ì‚ Superscript left-parenthesis t right-parenthesis can be generated based on bold-italic h Superscript left-parenthesis t right-parenthesis and then compared with bold-italic y Superscript left-parenthesis t right-parenthesis.

      However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length upper T is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss script l Superscript left-parenthesis upper T right-parenthesis is evaluated at t equals upper T, the gradient w.r.t. bold-italic upper W 1 calculated via backpropagation can be written as

      (10)StartLayout 1st Row 1st Column StartFraction delta script l Superscript left-parenthesis upper T right-parenthesis Baseline Over delta bold-italic upper W 1 EndFraction equals sigma-summation Underscript t equals 0 Overscript upper T Endscripts StartFraction delta script l Superscript left-parenthesis upper T right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis upper T right-parenthesis Baseline EndFraction left-parenthesis product Underscript j equals t plus 1 Overscript upper T Endscripts StartFraction delta bold-italic h Superscript left-parenthesis j right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline EndFraction right-parenthesis StartFraction delta bold-italic h Superscript left-parenthesis t right-parenthesis Baseline Over delta bold-italic upper W 1 EndFraction 2nd Column Blank EndLayout

      where product Underscript j equals t plus 1 Overscript upper T Endscripts StartFraction delta bold-italic h Superscript left-parenthesis j right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline EndFraction is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so

      (11)StartLayout 1st Row 1st Column bold-italic h Superscript left-parenthesis j right-parenthesis Baseline equals hyperbolic tangent left-parenthesis bold-italic upper W 1 bold-italic x Superscript left-parenthesis j right-parenthesis Baseline plus bold-italic upper W 2 bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline plus bold-italic b right-parenthesis 2nd Column Blank EndLayout

      6.3 Long Short‐Term Memory Networks

Скачать книгу