Sports Analytics in Practice with R. Ted Kwartler
Чтение книги онлайн.
Читать онлайн книгу Sports Analytics in Practice with R - Ted Kwartler страница 7
Next, Anup B, one of the most brilliant supportive leaders I have worked for. Not to mention your passion for cricket helped open my eyes to a noteworthy and enjoyable sport. Losing you to the pandemic was a disturbing blow felt by many people who were touched by your intelligence, humor, and positivity.
This entire book would not have been possible without the fine professors at the University of Notre Dame that put me on my own professional journey. I fondly remember building my first logistic regression predicting March Madness after learning these techniques from Dr. Keating, the late Dr. Gilbride, and Dr. Devaraj.
Further I would like to acknowledge my parents, Anatol and Trish, and my endearing wife, Meghan. Your support and patience has been significant. Writing a book is no small undertaking with much of the logistical burden falling to each of you. Completing this book is a shared victory.
Lastly, my sincerest gratitude to the wonderful team at Wiley, particularly Kimberly Monroe-Hill. Your patience and flexibility to late submissions and delayed seasons stemming from the unusual 2020 year in sports (among other more important hardships) has been greatly appreciated. I was ready to give up on the project yet your e-mails demonstrated a commitment from Wiley that I cherish.
1 Introduction to R
Objectives
Learn about R as a programming language
Define Integrated Development Environment
Define objects
Learn the assignment operator
Define functions
Executing a loop
Learn logical operators
Learn about R data types
Learn about object classes
Indexing data objects
Extending R functionality with packages
Writing a custom function
Create a scatter plot with sports data
Create a heatmap with sports data
R Libraries
ggplot2 ggthemes RCurl tidyr
R Functions
+ plot <- round class as.factor as.character c cbind rbind data.frame as.matrix as.data.frame install.packages library getURL read.csv dim names head tail summary table qplot pivot_longer geom_tile scale_fill_gradient xlab ggtitle theme theme_hc
The R Programming Language
R is an open-source, freely available programming language used throughout this book. R is a powerful and longstanding programming language developed more than 20 years ago. It is a derivative of the “S” programming language for statistics originating in the mid-1990s developed by AT&T and Lucent Technologies. Unlike other programming languages, R is optimized specifically for statistics including but not limited to simulation, machine learning, visualizations, and traditional statistical modeling (linear regression) as well as tests. Due to the open-source nature of R, many developers, academics, and enthusiasts have contributed to its development for their specific needs. As a result, the language is extensible meaning it can be easily used for various purposes. For example, through R markdown, simple websites and presentations can be created. In another use case, R can be used for traditional linear modeling or machine learning and can draw upon various data types for analysis including audio files, digital images, text, numeric, and various other data files and types. Thus, it is widely used and nonspecialized other than to say R is an analysis language. This differs from other languages which specialize in web development like Ruby or python which has extended its functionality to building applications not just analysis.
In this textbook, the R language is applied specifically to sports contexts. Of course, the code in this book can be used to extend your understanding of sports analytics. It may give you insights to a particular sport or analytical aspect within the sport itself such as what statistics should be focused on to win a basketball game. However, learning the code in this book can also help open up a world of analytical capabilities beyond sports. One of the benefits of learning statistics, programming, and various analysis methods with sports data is that the data is widely available and outcomes are known. This means that your analysis, models, and visualizations can be applied, and you can review the outcomes as you expand upon what is covered in this book. This differs from other programming and statistical examples which may resort to boring, synthetic data to illustrate an analytical result. Using sports data is realistic and can be future oriented, making the learning more challenging yet engaging. Modeling the survivors of the Titanic pales in comparison since you cannot change the historical outcome or save future cruise ship mates. Thus, modeling which team will win a match or which player is a good draft pick is a superior learning experience.
If you are new to programming don’t be intimidated. R is a forgiving language in that things like spacing an indentation are ignored. Further, the R community is well supported and a simple online search of any error message usually finds an answer quickly on any number of sites.
To begin your R and sports analytics journey, please download the “base-R” distribution for your operating system. The “Comprehensive R Archive Network,” CRAN, is the home of the official R distribution as well as officially supported packages (more on that in a bit). The site to download base-R is https://cran.r-project.org.
Unfortunately, base-R, having started in the nineties, looks abysmal and lacks some modern day functionality. Thus, you will need to next download the R-Studio Integrated Development Environment, or IDE. An IDE is software that consolidates many of the aspects needed to code into one place. For example, you will need to write code which could be done in a simple notepad like program, a place to execute the code written, a place to visualize plots that were output from the code, and so on. These individual components are assembled into the IDE for ease of use and fast development. R and many other languages have IDEs. In fact, R has multiple IDE optimized for the type of analysis you are performing such as biostatistics or working with another language like Java. The most popular and easily supported IDE for base-R is the R-Studio software. There are server and desktop versions available. The code executed in this book should work for either cloud or local but installation of base-R and R-Studio on a server is not covered. Therefore, please download the R-Studio desktop IDE by navigating to https://www.rstudio.com/products/rstudio.
The R-Studio IDE, or Integrated Development Environment, adds functionality and modern user interface to base-R. The IDE aggregates common functionality used for software development and statistical analysis.
Essentially R-Studio sits on top of base-R. The IDE provides a modern GUI expected of today’s computer users while also adding functionality including the use of version control, terminal access and perhaps most importantly an easy way to create and view visualizations