Читать онлайн книгу - Computational Statistics in Data Science. Группа авторов. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Computational Statistics in Data Science - Группа авторов

Скачать книгу

alt="tau left-parenthesis dot right-parenthesis"/> is also a nonlinear transformation with a range in the interval

left-parenthesis 0 comma 1 right-parenthesis

More formally, the map provided by an MLP from a sample to can be written as follows:

StartLayout 1st Row 1st Column delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis equals modifying above y with caret Subscript i Baseline equals tau left-parenthesis bold-italic upper V Superscript upper T Baseline gamma left-parenthesis bold-italic upper W Superscript upper T Baseline bold-italic x Subscript i Baseline right-parenthesis right-parenthesis 2nd Column Blank EndLayout

where , , , and and are nonlinear functions.

Figure 1 An MLP with three layers.

Most often, and are chosen to be the logistic function . This function is often chosen for the following desirable properties: (i) it is highly nonlinear, (ii) it is monotonically increasing, (iii) it is asymptotically bounded at some finite value in both the negative and positive directions, and (iv) its output lies in the interval , so that it stays relatively close to 0. However, Yann LeCun recommends that a different function be used: . This function retains all of the desirable properties of the logistic function and has the additional advantage of being symmetric about the origin, which results in outputs closer to 0 than the logistic function.

3.3 Training an MLP

We want to choose the weights and biases in such a way that they minimize the sum of squared errors within a given dataset. Similar to the general supervised learning approach, we want to find an optimal prediction such that

(1)

where , and is cross‐entropy loss.

(2)

where is the total number of classes; if the th sample belongs to class , otherwise it is equal to 0; and is the predicted score of the th sample belonging to class .

Function (1) cannot be minimized through differentiation, so we must use gradient descent. The application of gradient descent to MLPs leads to an algorithm known as backpropagation. Most often, we use stochastic gradient descent as that is far faster. Note that backpropagation can be used to train different types of neural networks, not just MLP.

We would like to address the issue of possibly being trapped in local minima, as backpropagation is a direct application of gradient descent to neural networks, and gradient descent is prone to finding local minima, especially in high‐dimensional spaces. It has been observed in practice that backpropagation actually does not typically get stuck in local minima and generally reaches the global minimum. There do, however, exist pathological data examples in which backpropagation will not converge to the global minimum, so convergence to the global minimum is certainly not an absolute guarantee. It remains a theoretical mystery why backpropagation does in fact generally converge to the global minimum, and under what conditions it will do so. However, some theoretical results have been developed to address this question. In particular, Gori and Tesi [7] established that for linearly separable data, backpropagation will always converge to the global solution.

So far, we have discussed a simple MLP with three layers aimed at classification problems. However, there are many extensions to the simple

Скачать книгу

Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 39

Информация о книге:

3.3 Training an MLP