Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 32
Lastly, the package Statsmodels facilitates data exploration, estimation, and statistical testing [8]. Built at even a higher level than the other packages discussed, Statsmodels employs NumPy, SciPy, Pandas, and Matplotlib. Many statistical models exist, such as linear regression, generalized linear models, probability distributions, and time series. See http://www.statsmodels.org/stable/index.html for the full feature list.
In addition to the four libraries discussed above, Python features numerous other bespoke packages for a particular task. For ML, the TensorFlow and PyTorch packages are widely used, and for Bayesian inference, Pyro and NumPyro are becoming popular (see more on these packages in Section 4). For big data computations, PySpark provides scalable tools to handle memory and computation time issues. For advanced data visualization, pyplot, seaborn, and plotnine may be worth adopting for a Python‐inclined data scientist.
Python's easy‐to‐learn syntax, speed, and versatility make it a favorite among programmers. Moreover, the packages listed above transform Python into a well‐developed vehicle for data science. We see Python's popularity only increasing in the future. Some believe that Python will eventually eliminate the need for R. However, we feel that the immediate future lies in a Python + R paradigm. Thus, R users may well consider exploring what Python offers as the languages have complementary features.
2.3 SAS®
SAS was born during the late 1960s, within the Department of Experimental Statistics at North Carolina State University. As the software developed, the SAS Institute was formed in 1976. Since its infancy, SAS has evolved into an integrated system for data analysis and exploration. The SAS system has been used in numerous business areas and academic institutions worldwide.
SAS provides packages to support various data analytic tasks. The SAS/STAT component contains capabilities one normally associates with data analysis. SAS/STAT supports analysis of variance (ANOVA), regression, categorical data analysis, multivariate analysis, survival analysis, psychometric analysis, cluster analysis, and nonparametric analysis. The SAS/INSIGHT package implements visualization strategies. Visualizations can be linked across multiple windows to uncover trends, spot outliers, and readily discern subtle patterns. Finally, SAS provides the user with a matrix‐programming language via the SAS/IML system. The matrix‐based language allows custom statistical algorithm development.
Recently, SAS's popularity has diminished [4]; yet, it remains widely used. Open‐source competitors threaten SAS's previous overall market dominance. Rather than complete removal, we see SAS becoming a niche product in the future. Now, however, SAS expertise remains desired in certain roles and industries.
2.4 SPSS®
Norman H. Nie, C. Hadlai (Tex) Hul, and Dale Brent developed SPSS in the late 1960s. The trio were Stanford University graduate students at the time. SPSS was founded in 1968 and incorporated in 1975. SPSS became publicly traded in 1993. Now, IBM owns the rights to SPSS. Originally, developers designed SPSS for mainframe use. In 1984, SPSS introduced SPSS/PC
SPSS features a wide variety of analytic capabilities including one for regression, classification trees, table creation, exact tests, categorical analysis, trend analysis, conjoint analysis, missing value analysis, map‐based analysis, and complex samples analysis. In addition, SPSS supports numerous stand‐alone products including Amos™ (a structural equation modeling package), SPSS Text Analysis for Surveys™ (a survey analysis package utilizing natural language processing (NLP) methodology), SPSS Data Entry™ (a web‐based data entry package; see Web Based Data Management in Clinical Trials), AnswerTree® (a market segment targeting package), SmartViewer® Web Server™ (a report‐generation and dissemination package), SamplePower® ( sample size calculation package), DecisionTime® and What if?™ (a scenario analysis package for the nonspecialist), SmartViewer® for Windows (a graph/report sharing utility), SPSS WebApp Framework (web‐based analytics package), and the Dimensions Development Library (a data capture library).
SPSS remains popular, especially in scholarly work [4]. For many researchers whom apply standard models, SPSS gets the job done. We see SPSS remaining a useful tool for practitioners across many fields.
3 Noteworthy Statistical Software and Related Tools
Next, we discuss noteworthy statistical software, aiming to provide essential details for a fairly complete survey of the most commonly used statistical software and related tools.
3.1 BUGS/JAGS
The BUGS (Bayesian inference using Gibbs sampling) project led to some of the most popular general‐purpose Bayesian posterior sampling programs – WinBUGS [10] and, later, OpenBUGS, the open‐source equivalent. BUGS begin in 1989 in the MRC Biostatistics Unit, Cambridge University. The project in part led to a rapid expansion of applied Bayesian statistics due its pioneering timing, relative ease of use, and broad range of applicable models.
JAGS (Just Another Gibbs Sampler) [11] was developed as a cross‐platform engine for the BUGS modeling language. A secondary goal was to provide extensibility, allowing user‐specific functions, distributions, and sampling algorithms. The BUGS/JAGS approach to specifying probabilistic models has become standard in other related software (e.g., NIMBLE). Both BUGS and JAGS are still widely used and are well suited for tasks of small‐to‐medium complexity. However, for highly complex models and big data problems there are similar, more‐powerful Bayesian inference engines emerging, for example, STAN and Pyro (see Section 4 for more details).
3.2 C++
C
C