Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 25
Multicore CPU processing is effective for parallel completion of multiple, mostly independent tasks that do not require intercommunication. One might generate 2 to, say, 72 independent Markov chains on a desktop computer or shared cluster. A positive aspect is that the tasks do not have to involve the same instruction sets at all; a negative is latency, that is, that the slowest process dictates overall runtime. It is possible to further speed up CPU computing with single instruction, multiple data (SIMD) or vector processing. A small number of vector processing units (VPUs) in each CPU core can carry out a single set of instructions on data stored within an extended‐length register. Intel's streaming SIMD extensions (SSE), advance vector extensions (AVX), and AVX‐512 allow operations on 128‐, 256‐, and 512‐bit registers. In the context of 64‐bit double precision, theoretical speedups for SSE, AVX, and AVX‐512 are two‐, four‐, and eightfold. For example, if a computational bottleneck exists within a for‐loop, one can unroll the loop and perform operations on, say, four consecutive loop bodies at once using AVX [21, 22]. Conveniently, languages such as OPENMP [97] make SIMD loop optimization transparent to the user [98]. Importantly, SIMD and multicore optimization play well together, providing multiplicative speedups.
While a CPU may have tens of cores, GPUs accomplish fine‐grained parallelization with thousands of cores that apply a single instruction set to distinct data within smaller workgroups of tens or hundreds of cores. Quick communication and shared cache memory within each workgroup balance full parallelization across groups, and dynamic on‐ and off‐loading of the many tasks hide the latency that is so problematic for multicore computing. Originally designed for efficiently parallelized matrix math calculations arising from image rendering and transformation, GPUs easily speed up tasks that are tensor multiplication intensive such as deep learning [99] but general‐purpose GPU applications abound. Holbrook et al. [21] provide a larger review of parallel computing within computational statistics. The same paper reports a GPU providing 200‐fold speedups over single‐core processing and 10‐fold speedups over 12‐core AVX processing for likelihood and gradient calculations while sampling from a Bayesian multidimensional scaling posterior using HMC at scale. Holbrook et al. [22] report similar speedups for inference based on spatiotemporal Hawkes processes. Neither application involves matrix or tensor manipulations.
A quantum computer acts on complex data vectors of magnitude 1 called qubits with gates that are mathematically equivalent to unitary operators [100]. Assuming that engineers overcome the tremendous difficulties involved in building a practical quantum computer (where practicality entails simultaneous use of many quantum gates with little additional noise), twenty‐first century statisticians might have access to quadratic or even exponential speedups for extremely specific statistical tasks. We are particularly interested in the following four quantum algorithms: quantum search [101], or finding a single 1 amid a collection of 0s, only requires
for