Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 26
5 Rise of Data Science
Core Challenges 4 and 5 – fast, flexible, and user‐friendly algo‐ware and hardware‐optimized inference – embody an increasing emphasis on application and implementation in the age of data science. Previously undervalued contributions in statistical computing, for example, hardware utilization, database methodology, computer graphics, statistical software engineering, and the human–computer interface [76], are slowly taking on greater importance within the (rather conservative) discipline of statistics. There is perhaps no better illustration of this trend than Dr. Hadley Wickham's winning the prestigious COPSS Presidents' Award for 2019
[for] influential work in statistical computing, visualization, graphics, and data analysis; for developing and implementing an impressively comprehensive computational infrastructure for data analysis through R software; for making statistical thinking and computing accessible to large audience; and for enhancing an appreciation for the important role of statistics among data scientists [106].
This success is all the more impressive because Presidents' Awardees have historically been contributors to statistical theory and methodology, not Dr. Wickham's scientific software development for data manipulation [107–109] and visualization [110, 111].
All of this might lead one to ask: does the success of data science portend the declining significance of computational statistics and its Core Challenges? Not at all! At the most basic level, data science's emphasis on application and implementation underscores the need for computational thinking in statistics. Moreover, the scientific breadth of data science brings new applications and models to the attention of statisticians, and these models may require or inspire novel algorithmic techniques. Indeed, we look forward to a golden age of computational statistics, in which statisticians labor within the intersections of mathematics, parallel computing, database methodologies, and software engineering with impact on the entirety of the applied sciences. After all, significant progress toward conquering the Core Challenges of computational statistics requires that we use every tool at our collective disposal.
Acknowledgments
AJH is supported by NIH grant K25AI153816. MAS is supported by NIH grant U19AI135995 and NSF grant DMS1264153.
Notes
1 1 Statistical inference is an umbrella term for hypothesis testing, point estimation, and the generation of (confidence or credible) intervals for population functionals (mean, median, correlations, etc.) or model parameters.
2 2 We present the problem of phylogenetic reconstruction in Section 3.2 as one such example arising from the field of molecular epidemiology.
3 3 The use of “N” and “P” to denote observation and parameter count is common. We have taken liberties in coining the use of “M” to denote mode count.
4 4 A more numerically stable approach has the same complexity [24].
5 5 The matrix parameter coincides with for linear regression and for auxiliary Pólya‐Gamma parameter for logistic regression [56, 57].
6 6 See Nishimura and Suchard [57] and references therein for the role and design of a preconditioner.
References
1 1 Davenport, T.H. and Patil, D. (2012) Data scientist. Harvard Bus. Rev., 90, 70–76.
2 2 Google Trends (2020) Data source: Google trends. https://trends.google.com/trends (accessed 12 July 2020).
3 3 American Statistical Association (2020) Statistics Degrees Total and By Gender, https://ww2.amstat.org/misc/StatTable1987-Current.pdf (accessed 01 June 2020).
4 4 Cleveland, W.S. (2001) Data science: an action plan for expanding the technical areas of the field of statistics. Int. Stat. Rev., 69, 21–26.
5 5 Donoho, D. (2017) 50 Years of data science. J. Comput. Graph. Stat., 26, 745–766.
6 6 Fisher, R.A. (1936) Design of experiments. Br Med J 1.3923, 554–554.
7 7 Fisher, R.A. (1992) Statistical methods for research workers, in Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics, Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY. (Especially Section 21.02). doi: 10.1007/978-1-4612-4380-9_6.
8 8 Wald, A. and Wolfowitz, J. (1944) Statistical tests based on permutations of the observations. Ann. Math. Stat., 15, 358–372.
9 9 Efron B. (1992) Bootstrap methods: another look at the jackknife, in Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics) (eds S. Kotz and N.L. Johnson), Springer, New York, NY, pp. 569–593. doi: 10.1007/978-1-4612-4380-9_41.
10 10 Efron, B. and Tibshirani, R.J. (1994) An Introduction to the Bootstrap, CRC press.
11 11 Bliss, C.I. (1935) The comparison of dosage‐mortality data. Ann. Appl. Biol., 22, 307–333 (Fisher introduces his scoring method in appendix).
12 12 McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd edn, Chapman and Hall, London. Standard book on generalized linear models.
13 13 Tierney, L. (1994) Markov chains for exploring posterior distributions. Ann. Stat., 22, 1701–1728.
14 14 Brooks, S., Gelman, A., Jones, G., and Meng, X.‐L. (2011) Handbook of Markov Chain Monte Carlo, CRC press.
15 15 Chavan, V. and Phursule, R.N. (2014) Survey paper on big data. Int. J. Comput. Sci. Inf. Technol., 5, 7932–7939.
16 16 Williams, C.K. and Rasmussen, C.E. (1996) Gaussian processes for regression. Advances in Neural Information Processing Systems, pp. 514–520.
17 17 Williams, C.K. and Rasmussen, C.E. (2006) Gaussian Processes for Machine Learning, vol. 2, MIT press, Cambridge, MA.
18 18 Gelman, A., Carlin, J.B., Stern, H.S. et al. (2013) Bayesian Data Analysis, CRC press.
19 19 Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N. et al. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys., 21, 1087–1092.
20 20