Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis
Чтение книги онлайн.
Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics Using Python - Daniel J. Denis страница 10
Advice for Instructors
The book can be used at either the advanced undergraduate or graduate levels, or for self-study. The book is ideal for a 16-week course, for instance one in a Fall or Spring semester, and may prove especially useful for programs that only have space or desire to feature a single data-analytic course for students. Instructors can use the book as a primary text or as a supplement to a more theoretical book that unpacks the concepts featured in this book. Exercises at the end of each chapter can be assigned weekly and can be discussed in class or reviewed by a teaching assistant in lab. The goal of the exercises should be to get students thinking critically and creatively, not simply getting the “right answer.”
It is hoped that you enjoy this book as a gentle introduction to the world of applied statistics using Python. Please feel free to contact me at [email protected] or [email protected] should you have any comments or corrections. For data files and errata, please visit www.datapsyc.com.
Daniel J. Denis
March, 2021
1 A Brief Introduction and Overview of Applied Statistics
CHAPTER OBJECTIVES
How probability is the basis of statistical and scientific thinking.
Examples of statistical inference and thinking in the COVID-19 pandemic.
Overview of how null hypothesis significance testing (NHST) works.
The relationship between statistical inference and decision-making.
Error rates in statistical thinking and how to minimize them.
The difference between a point estimator and an interval estimator.
The difference between a continuous vs. discrete variable.
Appreciating a few of the more salient philosophical underpinnings of applied statistics and science.
Understanding scales of measurement, nominal, ordinal, interval, and ratio.
Data analysis, data science, and “big data” distinctions.
The goal of this first chapter is to provide a global overview of the logic behind statistical inference and how it is the basis for analyzing data and addressing scientific problems. Statistical inference, in one form or another, has existed at least going back to the Greeks, even if it was only relatively recently formalized into a complete system. What unifies virtually all of statistical inference is that of probability. Without probability, statistical inference could not exist, and thus much of modern day statistics would not exist either (Stigler, 1986).
When we speak of the probability of an event occurring, we are seeking to know the likelihood of that event. Of course, that explanation is not useful, since all we have done is replace probability with the word likelihood. What we need is a more precise definition. Kolmogorov (1903–1987) established basic axioms of probability and was thus influential in the mathematics of modern-day probability theory. An axiom in mathematics is basically a statement that is assumed to be true without requiring any proof or justification. This is unlike a theorem in mathematics, which is only considered true if it can be rigorously justified, usually by other allied parallel mathematical results. Though the axioms help establish the mathematics of probability, they surprisingly do not help us define exactly what probability actually is. Some statisticians, scientists and philosophers hold that probability is a relative frequency, while others find it more useful to consider probability as a degree of belief. An example of a relative frequency would be flipping a coin 100 times and observing the number of heads that result. If that number is 40, then we might estimate the probability of heads on the coin to be 0.40, that is, 40/100. However, this number can also reflect our degree of belief in the probability of heads, by which we based our belief on a relative frequency. There are cases, however, in which relative frequencies are not so easily obtained or virtually impossible to estimate, such as the probability that COVID-19 will become a seasonal disease. Often, experts in the area have to provide good guesstimates based on prior knowledge and their clinical opinion. These probabilities are best considered subjective probabilities as they reflect a degree of belief or disbelief in a theory rather than a strict relative frequency. Historically, scholars who espouse that probability can be nothing more than a relative frequency are often called frequentists, while those who believe it is a degree of belief are usually called Bayesians, due to Bayesian statistics regularly employing subjective probabilities in its development and operations. A discussion of Bayesian statistics is well beyond the scope of this chapter and book. For an excellent introduction, as well as a general introduction to the rudiments of statistical theory, see Savage (1972).
When you think about it for a moment, virtually all things in the world are probabilistic. As a recent example, consider the COVID-19 pandemic of 2020. Since the start of the outbreak, questions involving probability were front and center in virtually all media discussions. That is, the undertones of probability, science, and statistical inference were virtually everywhere where discussions of the pandemic were to be had. Concepts of probability could not be avoided. The following are just a few of the questions asked during the pandemic:
What is the probability of contracting the virus, and does this probability vary as a function of factors such as pre-existing conditions or age? In this latter case, we might be interested in the conditional probability of contracting COVID-19 given a pre-existing condition or advanced age. For example, if someone suffers from heart disease, is that person at greatest risk of acquiring the infection? That is, what is the probability of COVID-19 infection being conditional on someone already suffering from heart disease or other ailments?
What proportion of the general population has the virus? Ideally, researchers wanted to know how many people world-wide had contracted the virus. This constituted a case of parameter estimation, where the parameter of interest was the proportion of cases world-wide having