Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 18

Computational Statistics in Data Science - Группа авторов

Скачать книгу

model structure and, thus, apply to a wider swath of target distributions or objective functions “out of the box”. Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central processing units (CPU), graphics processing units (GPU), and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.

      2.1 Big N

      Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of upper N factorial datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the bootstrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets [8, 10]. When, say, population means are of interest, each Monte Carlo iteration requires summations involving upper N expensive memory accesses. Another example of a computationally intensive model is Gaussian process regression [16, 17]; it is a popular nonparametric approach, but the exact method for fitting the model and predicting future values requires matrix inversions that scale script í’ª left-parenthesis upper N cubed right-parenthesis. As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.

      To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference [18] based on a large vector of observations bold x equals left-parenthesis x 1 comma ellipsis comma x Subscript upper N Baseline right-parenthesis. We specify our model for the data with a likelihood function normal pi left-parenthesis bold x vertical-bar bold-italic theta right-parenthesis equals product Underscript n equals 1 Overscript upper N Endscripts normal pi left-parenthesis x Subscript n Baseline vertical-bar bold-italic theta right-parenthesis and use a prior distribution with density function normal pi left-parenthesis bold-italic theta right-parenthesis to characterize our belief about the value of the upper P‐dimensional parameter vector bold-italic theta a priori. The target of Bayesian inference is the posterior distribution of bold-italic theta conditioned on bold x

      The denominator's multidimensional integral quickly becomes impractical as upper P grows large, so we choose to use the MetropolisHastings (M–H) algorithm to generate a Markov chain with stationary distribution normal pi left-parenthesis bold-italic theta vertical-bar bold x right-parenthesis [19, 20]. We begin at an arbitrary position bold-italic theta Superscript left-parenthesis 0 right-parenthesis and, for each iteration s equals 0 comma ellipsis comma upper S, randomly generate the proposal state bold-italic theta Superscript asterisk from the transition distribution with density q left-parenthesis bold-italic theta Superscript asterisk Baseline vertical-bar bold-italic theta Superscript left-parenthesis s right-parenthesis Baseline right-parenthesis. We then accept proposal state bold-italic theta Superscript asterisk with probability

Скачать книгу