Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 21

Computational Statistics in Data Science - Группа авторов

Скачать книгу

featured in a range of statistical and data science applications [46]. Traditionally, such techniques were commonly applied in the “upper N less-than-or-equal-to upper P” setting, and correspondingly computational algorithms focused on this situation [47], especially within the Bayesian literature [48].

      Due to a growing number of initiatives for large‐scale data collections and new types of scientific inquiries made possible by emerging technologies, however, increasingly common are datasets that are “big upper N” and “big upper P” at the same time. For example, modern observational studies using health‐care databases routinely involve upper N almost-equals 1 0 Superscript 5 Baseline tilde 1 0 Superscript 6 patients and upper P almost-equals 1 0 Superscript 4 Baseline tilde 1 0 Superscript 5 clinical covariates [49]. The UK Biobank provides brain imaging data on upper N equals 100 000 patients, with upper P equals 100 tilde 200 000, depending on the scientific question of interests [50]. Single‐cell RNA sequencing can generate datasets with upper N (the number of cells) in millions and upper P (the number of genes) in tens of thousands, with the trend indicating further growths in data size to come [51].

      3.1.1 Continuous shrinkage: alleviating big M

theta Subscript p Baseline vertical-bar lamda Subscript p Baseline comma tau tilde Normal Subscript upper N Baseline left-parenthesis 0 comma tau squared lamda Subscript p Superscript 2 Baseline right-parenthesis comma lamda Subscript p Baseline tilde pi Subscript local Baseline left-parenthesis dot right-parenthesis comma tau tilde pi Subscript global Baseline left-parenthesis dot right-parenthesis

      The idea is that the global scale parameter tau less-than-or-equal-to 1 would shrink most theta Subscript ps toward zero, while the local scale lamda Subscript ps, with its heavy‐tailed prior pi Subscript l o c a l Baseline left-parenthesis dot right-parenthesis, allow a small number of tau lamda Subscript p and hence theta Subscript ps to be estimated away from zero. While motivated by two different conceptual frameworks, the spike‐and‐slab can be viewed as a subset of global–local priors in which pi Subscript l o c a l Baseline left-parenthesis dot right-parenthesis is chosen as a mixture of delta masses placed at lamda Subscript p Baseline equals 0 and lamda Subscript p Baseline equals sigma slash tau. Continuous shrinkage mitigates the multimodality of spike‐and‐slab by smoothly bridging small and large values of lamda Subscript p.

      On the other hand, the use of continuous shrinkage priors does not address the increasing computational burden from growing upper N and upper P in modern applications. Sparse regression posteriors under global–local priors are amenable to an effective Gibbs sampler, a popular class of MCMC we describe further in Section 4.1. Under the linear and logistic models, the computational bottleneck of this Gibbs sampler stems from the need for repeated updates of bold-italic theta from its conditional distribution