Читать онлайн книгу - Computational Statistics in Data Science. Группа авторов. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Computational Statistics in Data Science - Группа авторов

Скачать книгу

Y Y Y Open source, interactive data science RStudio Y Y Y Excellent at creating reproducible reports/docs

1.1 Extensible Text Editors: Emacs and Vim

GNU's text‐editor Emacs (https://www.gnu.org/software/emacs/) is completely free software and offers a powerful solution to working with statistical software. Emacs (or EMACS) is an extensible and customizable text editor that could be used to complete the majority of all computer‐based tasks. Once a user learns the keyboard‐centric user interface through muscle memory, editing text for reports or coding becomes rapid and outpaces point‐and‐click style approaches. Emacs works on all major operating systems and gives near‐seamless interaction on Linux‐based computing clusters. The extensibility ensures that while the latest tools develop and change, your interface will remain constant. This quality will provide confidence to adopt new tools and adapt to new trends in software.

Using Emacs for specifically statistical computing, we note the excellent add‐on package called Emacs Speaks Statistics (ESS) that offers a unified user interface for R, S‐Plus, SAS, Stata, and OpenBUGS/JAGS, among other popular statistical packages. An easy‐to‐use package manager provides quick ESS installation. Once installed, a basic workflow would be to open an associated file type (.R,.Rmarkdown, etc.) to trigger ESS mode. In ESS mode, code is highlighted, tab completion enabled for rapid code generation and editing, and help documentation integrated. Code can be interactively evaluated in separate processes (e.g., a single or even multiple R sessions), or code can be run noninteractively through Emacs‐displayed shell processes. Statistical visualizations are displayed in separate windows for easy plot development. As mentioned above, one can work seamlessly on remote servers (using TRAMP mode). This greatly reduces the inefficiencies inherent to switching between local and remote machines.

We also mention another popular extensible text editor Vim (https://www.vim.org/). Vim offers many of the same benefits as Emacs. There is a constant debate over the superiority of either Vim or Emacs. We avoid this discussion here and simply admit that the first author is an Emacs user, leading to the discussion above. This is not a vote of confidence toward Emacs over Vim but simply a reflection of familiarity.

1.2 Jupyter Notebooks

The Jupyter Project is an effort to develop open‐source software and services for interactive computing across a variety of popular programming languages such as Python, R, Julia, and C++. The interactive environment is based on notebooks which contain text cells and code cells. Text cells can utilize a mix of plain text, markdown, and render LaTeX through the Mathjax engine. Code cells can be run, modified, and rerun in any order. This functionality makes it easy to perform data analyses and document your work as you go.

The Jupyter IDE (integrated development environment) is run locally in a web browser and can be configured for remote and multiuser workflows. Since reproducible data science is a core feature of the Jupyter Project, they have made it so that notebooks can be exported and shared online as an interactive document or as a static HTML or PDF document. Services such as mybinder.org let a user upload and run notebooks online so that an analysis is instantly reproducible by anyone.

1.3 RStudio and Rmarkdown

RStudio is an organization that develops free and enterprise‐ready tools for working with the R language. Their IDE (also called RStudio) integrates the R console, file browser, script editor, and more in one unified user interface. Through the use of project‐associated directories/files, the entire projects are nearly self‐contained and easily shared among different systems.

Similar to Jupyter Notebooks, RStudio supports a file format called Rmarkdown that allows for code to be embedded and executed in a markdown‐style document. The basic setup is a YAML (https://yaml.org/) header, markdown text, and code chunks. This simple structure can be built upon through the use of the knitr package that can build PDF, HTML, or XML (MS Word) documents and – via the R package rticles – build journal‐style documents from the same basic file format. Knitr can also create slideshows just by changing a parameter in the YAML header. This kind of flexibility for document creation is a huge (and unique) advantage to using Rmarkdown, and it is easily done using the RStudio IDE. Notably, Rmarkdown supports many other programming engines besides R, such as Python, C++, and Julia.

2 Popular Statistical Software

With introductory matters behind, we now transition to discussions of the most popular statistical computing languages. We begin with R, our preferred statistical programming language. This leads to an unbalanced discussion compared to the other most popular statistical software (Python, SAS, and SPSS); yet we hope to provide objective recommendations despite the unequal coverage.

2.1 R

R [1] began at the University of Auckland, New Zealand, in the early 1990s. Ross Ihaka and Robert Gentleman needed a statistical environment to use in their teaching lab. At the time, their computer labs featured only Macintosh computers that lacked suitable software. Ihaka and Gentleman decided to implement a language based on an S‐like syntax [2]. R's initial versions were provided to Statlib at Carnegie Mellon University, and the user feedback indicated a positive reception.

R's success encouraged its release under the Open Source Initiative (https://opensource.org/). Developers released the first version in June 1995. A software system under the open‐source paradigm benefits from having “many pairs of eyes to develop the software.” R developed a huge following, and it soon became difficult for the developers to maintain. As a response, a 10‐member core group was formed in 1997. The core team handles any changes to the R source code. The massive R community provides support via online mailing lists (https://www.r‐project.org/mail.html) and statistical computing forums – such as Talk Stats (http://www.talkstats.com/), Cross Validated (https://stats.stackexchange.com/), and Stack Overflow (https://stackoverflow.com/). Often users receive responses within a matter of minutes.

Since humble beginnings, R has developed into a popular, complete, and flexible statistical computing environment that is appreciated by academia, industry, and government. R's main benefits include support on all major operating systems and comprehensive package archives. Further, R integrates well with document formats (such as LaTeX (https://www.latex‐project.org/), HTML, and Microsoft Word) through R Markdown (https://rmarkdown.rstudio.com/) and other file formats to enhance literate programming and reproducible data analysis.

R provides extensive statistical capacity. Nearly any method is available as an R package – the trick is locating the software. The base package and default included packages perform most standard analyses and computation. If the included packages