Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 32

Computational Statistics in Data Science - Группа авторов

Скачать книгу

efficient methods for data sorting, splicing, merging, grouping, and indexing. Pandas implements robust input/output tools – supporting flat files, Excel files, databases, and HDF files. Additionally, Pandas provides visualization methods via Matplotlib [9].

      Lastly, the package Statsmodels facilitates data exploration, estimation, and statistical testing [8]. Built at even a higher level than the other packages discussed, Statsmodels employs NumPy, SciPy, Pandas, and Matplotlib. Many statistical models exist, such as linear regression, generalized linear models, probability distributions, and time series. See http://www.statsmodels.org/stable/index.html for the full feature list.

      In addition to the four libraries discussed above, Python features numerous other bespoke packages for a particular task. For ML, the TensorFlow and PyTorch packages are widely used, and for Bayesian inference, Pyro and NumPyro are becoming popular (see more on these packages in Section 4). For big data computations, PySpark provides scalable tools to handle memory and computation time issues. For advanced data visualization, pyplot, seaborn, and plotnine may be worth adopting for a Python‐inclined data scientist.

      Python's easy‐to‐learn syntax, speed, and versatility make it a favorite among programmers. Moreover, the packages listed above transform Python into a well‐developed vehicle for data science. We see Python's popularity only increasing in the future. Some believe that Python will eventually eliminate the need for R. However, we feel that the immediate future lies in a Python + R paradigm. Thus, R users may well consider exploring what Python offers as the languages have complementary features.

      2.3 SAS®

      SAS was born during the late 1960s, within the Department of Experimental Statistics at North Carolina State University. As the software developed, the SAS Institute was formed in 1976. Since its infancy, SAS has evolved into an integrated system for data analysis and exploration. The SAS system has been used in numerous business areas and academic institutions worldwide.

      Recently, SAS's popularity has diminished [4]; yet, it remains widely used. Open‐source competitors threaten SAS's previous overall market dominance. Rather than complete removal, we see SAS becoming a niche product in the future. Now, however, SAS expertise remains desired in certain roles and industries.

      2.4 SPSS®

      Norman H. Nie, C. Hadlai (Tex) Hul, and Dale Brent developed SPSS in the late 1960s. The trio were Stanford University graduate students at the time. SPSS was founded in 1968 and incorporated in 1975. SPSS became publicly traded in 1993. Now, IBM owns the rights to SPSS. Originally, developers designed SPSS for mainframe use. In 1984, SPSS introduced SPSS/PCplus for computers running MS‐DOS, followed by a UNIX release in 1988 and a Macintosh version in 1990. SPSS features an intuitive point‐and‐click interface. This design empowers a broad user base to conduct standard analyses.

      SPSS features a wide variety of analytic capabilities including one for regression, classification trees, table creation, exact tests, categorical analysis, trend analysis, conjoint analysis, missing value analysis, map‐based analysis, and complex samples analysis. In addition, SPSS supports numerous stand‐alone products including Amos™ (a structural equation modeling package), SPSS Text Analysis for Surveys™ (a survey analysis package utilizing natural language processing (NLP) methodology), SPSS Data Entry™ (a web‐based data entry package; see Web Based Data Management in Clinical Trials), AnswerTree® (a market segment targeting package), SmartViewer® Web Server™ (a report‐generation and dissemination package), SamplePower® ( sample size calculation package), DecisionTime® and What if?™ (a scenario analysis package for the nonspecialist), SmartViewer® for Windows (a graph/report sharing utility), SPSS WebApp Framework (web‐based analytics package), and the Dimensions Development Library (a data capture library).

      SPSS remains popular, especially in scholarly work [4]. For many researchers whom apply standard models, SPSS gets the job done. We see SPSS remaining a useful tool for practitioners across many fields.

      Next, we discuss noteworthy statistical software, aiming to provide essential details for a fairly complete survey of the most commonly used statistical software and related tools.

      3.1 BUGS/JAGS

      JAGS (Just Another Gibbs Sampler) [11] was developed as a cross‐platform engine for the BUGS modeling language. A secondary goal was to provide extensibility, allowing user‐specific functions, distributions, and sampling algorithms. The BUGS/JAGS approach to specifying probabilistic models has become standard in other related software (e.g., NIMBLE). Both BUGS and JAGS are still widely used and are well suited for tasks of small‐to‐medium complexity. However, for highly complex models and big data problems there are similar, more‐powerful Bayesian inference engines emerging, for example, STAN and Pyro (see Section 4 for more details).

      3.2 C++

      Cplus plus is a general‐purpose, high‐performance programming language. Unlike other scripting languages for statistics such as R and Python, Cplus plus is a compiled language – adding complexity (such as memory management) and strict syntax requirements. As such, C's design may complicate prototyping. Thus, data scientists typically turn to Cplus plus to optimize/scale a developed algorithm at the production level.

      Cplus plus's standard libraries lack many mathematical and statistical operations. However, since Cplus plus can be compiled cross‐platform, developers often interface Cplus plus functions from different languages (e.g., R and Python). Thus, Cplus plus can be used to develop libraries across languages, offering

Скачать книгу