8 shows the information passing process. At , network takes in a random initialed vector together with and outputs , and then at , takes in both and and outputs . This process is repeated over all data points in the input sequence.
Figure 8 Architecture of recurrent neural network (RNN).
Though multiple network blocks are shown on the right side of Figure 8, they share the same structure and weights. A simple example of the process can be written as
(9)
where and are weight matrices of network , is an activation function, and is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output can be passed into another network to make prediction. For a sequence‐to‐sequence model, can be generated based on and then compared with .
However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss is evaluated at , the gradient w.r.t. calculated via backpropagation can be written as
(10)
where is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so
(11)
Therefore, , and is always smaller than 1. When becomes larger, the gradient will get closer to zero, making it hard to train the network and update the weights with remote information. However, it is possible that relevant information is far apart in the sequence, so how to leverage remote information of a long sequence is important.