Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 33
Among the C
Eigen is a high‐level, header‐only library developed by Guennebaud et al. [16]. Eigen provides classes dealing with vector types, arrays, and dense/sparse/large matrices. It also supports matrix decomposition and geometric features. Eigen uses single instruction multiple data vectorization to avoid dynamic memory allocation. Eigen also implements extra features to optimize the computing performance, including unrolling techniques and processor‐cache utilization. Eigen itself does not take much advantage from parallel hardware, currently supporting parallel processing only for general matrix–matrix products. However, since Eigen uses BLAS‐compatible libraries, users can utilize external BLAS libraries in conjunction with Eigen for parallel computing. Python and R users can call Eigen functions using the minieigen and RcppEigen packages.
The National ICT Australia (NICTA) developed the open‐source library Armadillo to facilitate science and engineering [17]. Armadillo provides a fast, easy‐to‐use matrix library with MATLAB‐like syntax. Armadillo employs template meta‐programming techniques to avoid unnecessary operations and increase library performance. Further, Armadillo supports 3D objects and provides numerous utilities for matrices manipulation and decomposition. Armadillo automatically utilizes open multiprocessing (OpenMP) [19] to increase speed. Developers designed Armadillo to provide a balance between speed and ease of use. Armadillo is widely used for many applications in ML, pattern recognition, signal processing, and bioinformatics. R users may call Armadillo functions through the RcppArmadillo package.
Blaze is a high‐performance math library for dense/sparse arithmetic developed by Iglberger et al. [18]. Blaze extensively uses LAPACK functions for various computing tasks, such as matrix decomposition and inversion, providing high‐performance computing. Blaze supports high‐performance parallex (HPX) [20] and OpenMP to enable parallel computing.
The difficulty to develop C
3.3 Microsoft Excel/Spreadsheets
Much of statistical work today involves the use of Microsoft Excel and other spreadsheet‐style applications (Google Sheets, Apple Numbers, etc.). A spreadsheet application provides a simple and interactive way to collect data. This has an appeal for any manual data entry process. The sheets are easy to share, both through traditional file sharing (e.g., e‐mail attachments) and cloud‐based solutions (Google Drive, Dropbox, etc.). Simple numeric summaries and plots are easy to construct. More advanced macros/scripts are possible, yet most data scientists would prefer to switch to a more full‐featured environment (such as R or Python). Yet, as nearly all computer workers have some level of familiarity with spreadsheets, spreadsheets remain hugely popular and ubiquitous in organizations. Thus, we wager that spreadsheet applications will likely always be involved in statistical software and posit they can be quite efficient for appropriate tasks.
3.4 Git
Very briefly, we mention Git, a free and open‐source distributed version control system (https://git‐scm.com/). As the complexities of modern data science workflows increase, statistical programmers are increasingly reliant on some type of version control system, with Git being the most popular. Git allows for a branching scheme to foster experimentation in projects and to converge to a final product. By compiling a complete history of a project, Git provides transparent data analyses for reproducible research. Further, projects and software can be shared easily via web‐based repositories, such as GitHub (https://github.com/).
3.5 Java
Java is one of the most popular programming languages (according to the TIOBE index, www.tiobe.com/tiobe‐index/), partially due to its extensive library ecosystem. Java's design seduces programmers – it is simple, object oriented, and portable. Java applications run on any machine, from personal laptops to high‐performance supercomputers, even game consoles and internet of things (IoT) devices. Notably, Android (based on Java) development has driven recent Java innovations. Java's “write once, run anywhere” adage provides versatility, triggering interest even at the research level.
Developers may prefer Java for intensive calculations performing slowly within scripted languages (e.g., R). For speed‐up purposes, Java's cross‐platform design could even be preferred to C/C
Popular sources of native Java statistical and mathematical functionalities are JSC (Java Statistical Classes) and Apache Commons Math application programming interfaces (APIs) (http://commons.apache.org/proper/commons‐math/). JSC and Apache Commons Math libraries perform many methods including univariate statistics, parametric and nonparametric tests (
Additionally, Java boasts an extensive number of machine‐learning packages and big data capabilities. For example, Java enables the WEKA [21] tool, the JSAT library [22], and the TensorFlow framework [23]. Moreover, Java provides one of the most desired and useful big data analysis tools – Apache Spark [24]. Spark provides ML support through modules in the Spark MLlib library [25].
As with other discussed software, Java APIs often require importing other packages/libraries. For example, developers commonly use external matrix‐operation libraries, such as JAMA (Java matrix package, https://math.nist.gov/javanumerics/jama/) or EJML (efficient Java matrix library, http://ejml.org/wiki/). Such packages allow for routine computation – for example, matrix decomposition and dense/sparse