Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 31
2.1.1 Why use R over Python or Minitab?
R is tailored to working with data and performing statistical analysis in a way that is more consistent and extensible than Python. The syntax for accessing data in lists and data frames is convenient with tab completion showing what elements are in an object. Creating documents, reports, notebooks, presentations, and web pages is possible through Rmarkdown/RStudio.
Through the use of the metapackage tidyverse or the library data.table, working with tabular data is direct, efficient, and intuitive. Because R is a scripted language, reproducible workflows are possible, and steps in the process of extracting and transforming data are easy to go back and modify without disrupting the analysis. While this is a virtue shared among all scripting languages, the nature of reproducible results and modular code saves time compared to a point‐and‐click interface like that of Excel or Minitab.
2.1.2 Where can users find R support?
R has a large community for support online and even built‐in documentation within the software. Most libraries provide documentation and examples for their functions and objects that can be accessed via the ? in the command line (e.g., type ?glm
for help about creating a generalized linear model). These help documents are displayed directly in the console, or if using RStudio, they are displayed in the help panel with extra links to related functions. For more in‐depth documentation, some developers provide vignettes for their packages. Vignettes are long‐form documentation that demonstrates how to use the functionality in the package and tie it together with a working example.
The online R community is lively, and the people are often helpful. Searching for any question about R or its packages will often lead you to a post on Stack Overflow (https://stackoverflow.com/) or Reddit (either r/rstats or r/RStudio). There is also the RStudio Community (https://community.rstudio.com/) where you can go to ask questions about features specific to the IDE. It is rare to encounter an R programming challenge that has not been addressed somewhere online and, in that case, a well‐posed question posted on such forums is quickly answered. Twitter also has an active community of developers that can sometimes respond directly (such as # RSTUDIO or HADLEYWICKHAM).
2.1.3 How easy is R to develop?
R is becoming easier and easier to develop packages and analyses with. This is largely due to the efforts of RStudio, bringing slick new tools and support software on a regular basis. Their software “combine robust and reproducible data analysis with tools to effectively share data products.” One package that integrates well with RStudio is devtools written by Dr Hadley Wickham, the chief scientist at RStudio. devtools provides a plethora of tools to create, test, and export R packages. devtools has grown so comprehensive that developers have split the project into several smaller packages such as testthat (for writing tests), roxygen2 (for writing R documentation), usethis (for automating package setup, data, imports, etc.), and a few others that provide convenient tools for building and testing packages.
2.1.4 What is the downside of R?
R is slow. Or at least that is the perception and sometimes the case. This is because R is not a compiled language, so methods of flow control such as for‐loops are not optimized. This shortcoming is easily circumvented by taking advantage of the vectorization offered through other built‐in functions like those from the apply family in R, but these faster techniques often go unused through lack of proficiency or because it is easier to write a for‐loop. Intrinsically slow functions can be written in C++ and run via Rcpp, but then that negates the simplicity of writing R. This is a special case where Python easily surpasses R. Python is also a scripted language, but through the use of NumPy and numba it can gain fast vectorized operations, loops, and utilize a just‐in‐time (JIT) compiler. Ergo, any performance shortcoming of Python can be taken care of through a decorator.
Packages are not written by programmers, or at least not programmers by trade or education. A great deal of libraries for R are written by researchers and analysts who needed a tool and created the tool. Because of this, there is often fragmentation in the syntax or incompatibility between packages, or generally a lack of best practices that leads to poorly performing code, or, in the most drastic setting, code that simply gives erroneous results.
2.1.5 Summary of R
R is firmly entrenched as a premier statistical software package. Its open‐source, community‐based approach has taken the statistical software scene by storm. R's interactive and scripting programming style makes it an attractive and flexible analytic tool. R does lack the speed/flexibility of other languages; yet, for a specialist in statistics, R provides a near‐complete solution. RStudio's efforts further solidify R as a key player moving forward in the modern statistical software ecosystem. We see the popularity of R continuing – however, big data's demands could force R programmers to adapt other tools in conjunction with R, if companies/developers fail to keep up with tomorrow's challenges.
2.2 Python
Created by Guido van Rossum and released in 1991, Python is a hugely popular programming language [4]. Python features readable code, an interactive workflow, and an object‐oriented design. Python's architecture affords rapid application development from prototyping to production. Additionally, many tools integrate nicely with Python, facilitating complex workflows. Python also possesses speed, as most of its high‐performance libraries are implemented in C/C
Python's core distribution lacks statistical features, prompting developers to create supplementary libraries. Below, we detail four well‐supported statistical and mathematical libraries: NumPy [5], SciPy [6], Pandas [7], and Statsmodels [8].
NumPy is a general and fundamental package for scientific computing [5]. NumPy provides functions for operations on large arrays and matrices, optimized for speed via a C implementation. The package features a dense, homogeneous array called ndarray. ndarray provides computational efficiency and flexibility. Developers consider NumPy a low‐level tool as only foundational functions are available. To enhance capabilities, other statistical libraries and packages use NumPy to provide richer features.
One widely used higher level package, SciPy, employs NumPy to enable engineering and data science [6]. SciPy contains modules addressing standard problems in scientific computing, such as mathematical integration, linear algebra, optimization, statistics, clustering, image, and signal processing.
Another higher level Python package built upon NumPy, Pandas, is designed particularly for data analysis, providing standard models and cohesive frameworks [7]. Pandas implements a data type named DataFrame – a concept