Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 38

Computational Statistics in Data Science - Группа авторов

Скачать книгу

l"/> can be any loss function that evaluates the distance between delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis and y Subscript i, such as cross‐entropy loss and square loss.

      2.3 Gradient Descent

      The form of the function delta will usually be fairly complex, so attempting to find delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis via direct differentiation will not be feasible. Instead, we use gradient descent to minimize the error function.

      Gradient descent is a general optimization algorithm that can be used to find the minimizer of any given function. We pick an arbitrary starting point, and then at each time point, we take a small step in the direction of the greatest decrease, which is given by the gradient. The idea is that if we repeatedly do this, we will eventually arrive at a minimum. The algorithm guarantees a local minimum, but not necessarily a global one [4]; see Algorithm 1.

c03-gra-0001 c03-gra-0002

      3.1 Introduction

      A feedforward neural network, also known as a multilayer perceptron (MLP), is a popular supervised learning method that provides a parameterized form for the nonlinear map delta from an input to a predicted label [6]. The form of delta here can be depicted graphically as a directed layered network, where the directed edges go upward from nodes in one layer to nodes in the next layer. The neural network has been seen to be a very powerful model, as they are able to approximate any Borel measurable function to an arbitrary degree, provided that parameters are chosen correctly.

      3.2 Model Description

      The bottom layer of a three‐layer MLP is called the input layer, with each node representing the respective elements of an input vector. The top layer is known as the output layer and represents the final output of the model, a predicted vector. Again, each node in the output layer represents the respective predicted score of different classes. The middle layer is called the hidden layer and captures the unobserved latent features of the input. This is the only layer where the number of nodes is determined by the user of the model, rather than the problem itself.

      The directed edges in the network represent weights from a node in one layer to another node in the next layer. We denote the weight from a node x Superscript i in the input layer to a node h Superscript j in the hidden layer as w Subscript i j. The weight from a node h Superscript j in the hidden layer to a node modifying above y with caret Superscript k in the output layer will be denoted v Subscript j k. In each of the input and hidden layers, we introduce intercept nodes, denoted x Superscript 0 and h Superscript 0, respectively. Weights from them to any other node are called biases. Each node in a given layer is connected by a weight to every node in the layer above except the intercept node.

      The value of each node in the hidden and output layers is determined as a nonlinear transformation of the linear combination of the values of the nodes in the previous layers and the weights from each of those nodes to the node of interest. That is, the value of h Superscript j, j equals 1 comma period period period comma m, is given by gamma left-parenthesis bold-italic w Subscript j Superscript upper T Baseline bold-italic x right-parenthesis, where bold-italic w Subscript j Baseline equals left-parenthesis w Subscript 0 j Baseline comma period period period comma w Subscript p j Baseline right-parenthesis Superscript upper T, bold-italic x equals left-parenthesis 1 comma x Superscript 1 Baseline comma period period period comma x Superscript p Baseline right-parenthesis Superscript upper T, and gamma left-parenthesis dot right-parenthesis is a nonlinear transformation with range in the interval left-parenthesis 0 comma 1 right-parenthesis. Similarly, the value of modifying above y with caret Superscript k, k equals 1 comma period period period comma c, is given by tau left-parenthesis bold-italic v Subscript k Superscript upper T Baseline bold-italic h right-parenthesis, where bold-italic v Subscript k Baseline equals left-parenthesis v Subscript 0 k Baseline comma period period period comma v Subscript m k Baseline right-parenthesis Superscript upper T, bold-italic h equals left-parenthesis 1 comma h Superscript 1 Baseline comma period period period comma h Superscript m Baseline right-parenthesis Superscript upper T, and

Скачать книгу