Читать онлайн книгу - Computational Statistics in Data Science. Группа авторов. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Computational Statistics in Data Science - Группа авторов

Скачать книгу

8 shows the information passing process. At

, network

takes in a random initialed vector

bold-italic h Superscript left-parenthesis 0 right-parenthesis

together with

bold-italic x Superscript left-parenthesis 1 right-parenthesis

and outputs

bold-italic h Superscript left-parenthesis 1 right-parenthesis

, and then at

takes in both

bold-italic x Superscript left-parenthesis 2 right-parenthesis

and

and outputs

bold-italic h Superscript left-parenthesis 2 right-parenthesis

. This process is repeated over all data points in the input sequence.

Figure 7 Feedforward network.

Figure 8 Architecture of recurrent neural network (RNN).

Though multiple network blocks are shown on the right side of Figure 8, they share the same structure and weights. A simple example of the process can be written as

(9)

where and are weight matrices of network , is an activation function, and is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output can be passed into another network to make prediction. For a sequence‐to‐sequence model, can be generated based on and then compared with .

However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss is evaluated at , the gradient w.r.t. calculated via backpropagation can be written as

(10)

where is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so

(11)

Therefore, , and is always smaller than 1. When becomes larger, the gradient will get closer to zero, making it hard to train the network and update the weights with remote information. However, it is possible that relevant information is far apart in the sequence, so how to leverage remote information of a long sequence is important.

Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 44

Информация о книге:

6.3 Long Short‐Term Memory Networks

Скачать книгу

Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 44

Информация о книге:

6.3 Long Short‐Term Memory Networks Скачать книгу

6.3 Long Short‐Term Memory Networks

Скачать книгу