Experimental Design and Statistical Analysis for Pharmacology and the Biomedical Sciences. Paul J. Mitchell
Чтение книги онлайн.
Читать онлайн книгу Experimental Design and Statistical Analysis for Pharmacology and the Biomedical Sciences - Paul J. Mitchell страница 19
The Central Limit Theorem
Luckily, however, the small differences that arise as a result of taking samples from a population are not a huge issue thanks to what is known as the Central Limit Theorem, which states that, given a large enough sample size, then the sampling distribution of the sample mean will approximate to a normal distribution regardless of the variable's distribution in the given population. I know I have not described or explained the nature of the normal distribution as yet (sorry!), but have a quick look at Figure 4.7 later in this chapter and compare the shape to the distributions of data sets shown in Figures 5.3 and 5.4 in Chapter 5; can you see the differences in shape?
So, what does this theorem mean? Well, for any set of observations we can easily produce a scatterplot of the magnitude of the observation on the x‐axis against the frequency of occurrence for each value on the y‐axis. The resulting scatterplot is called the frequency distribution for that variable. Interestingly, the values of a variable in any given population may follow different distributions including a normal distribution ( Figure 4.7 ) or distributions that show a right or left skew (Figures 5.3 and 5.4, respectively) in the frequency distribution scatterplot.
However, if we take a sufficiently large number of random samples from a population and record the mean of those samples (i.e. this is what is known as the sample mean) and then repeat this process a number of times (making sure we replace the random values each time to maintain the population size and distribution), then the distribution of the sample means (if plotted as a histogram) will approximate to a normal distribution, irrespective of the inherent distribution of all the samples in the original population. The shape of the resulting histogram is known as the sampling distribution of the mean.
Unfortunately, the shape of the sampling distribution depends on the number of samples taken each time from the population. In most cases a sample size of 30 is sufficient for the sampling distribution of the mean to approximate to a normal distribution. However, with smaller sample sizes, the resulting sampling distribution is generally different from the normal distribution and instead approximates to a t‐distribution where the shape of the sampling distribution depends on the sample size (see Figure 4.9; notice that as the sample size increases, so the shape of the curve approximates to a normal distribution!). The Central Limit Theorem is important in statistics since it links the distribution of the variable in the population to the sampling distribution of the mean. Furthermore, it is vital to understand the theorem when we start to consider the confidence intervals of different statistical parameters (see later in Chapter 22).
Types of data
In the majority of our experiments, the data we obtain will be numerical in nature. However, it is important to carefully distinguish the nature of the data being analysed since not all data may be treated similarly. Consequently, the type of statistical analysis we employ depends on the type of data obtained; i.e. statistical tests are generally specific for the kind of data we wish to analyse.
In general terms, there are three kinds of data, although as can be seen below, there may be further differences within measurement data depending on form and scale.
1 Nominal (categorical) dataSuch data are where either numerals are applied to attributes or categories that are not strictly measures but allow accurate identification, or where the number of observations in a category may be recorded. For example, hair colour may be a category and the frequency of individuals with black, brown, red, blonde, or brunette hair is recorded. The results of survey data are typically categorical.
2 Ordinal dataSuch data are where a scale with ranks is employed to order the observations. The rationale behind the ranks is that the values may be ranked in order (which makes it an ordinal scale) of magnitude. Data obtained from well‐being scales are examples of ordinal (ranked) data.
3 Measurement dataNumerical data may exist in two forms and in three types of scale.Form of measurement data
1 Discrete data (aka meristic) are generally counts and may only be discrete values normally represented by integers.
2 In contrast, continuous data are those observations or measurements where the precision is only limited by the experimenter and the equipment used.
Types of scale
1 An interval scale is where the values are measured on a scale where the differences are uniform but ratios not so. For example, on the Celsius temperature scale, the difference between 5° and 10° is the same as between 10° and 15°, but the ratio between 5° and 15° does not imply that the latter is three times as warm as the former.
2 A ratio scale is where the values have a meaningful zero point. Examples here include length, weight, and volume. Thus, 15 cm is three times longer than 5 cm, 2 kg is twice as heavy as 1 kg, and 200 ml is four times the volume of 50 ml.
3 A circular scale may be used when one measures annual dates, clock times, etc. Generally, neither differences nor ratios of data obtained from circular scales are sensible or useful derivatives, and consequently special methods are employed for such data; such methods are outside the scope of this book.
A further issue we need to consider before deciding which statistical methods are appropriate to apply to our experimental data concerns the distribution of our data sets.
Classification of data distributions
If we collect a large number of observations in an experiment and from these data produce a histogram plot (where the frequency of the observations is plotted against magnitude), then the resulting figure represents a summary of the distribution of the data. With sufficient number of observations, then the frequency of occurrence of each observation is closely related to the probability that future observations will have a particular value. Furthermore, the distributions created by our data often map to distributions that are mathematically generated. Each distribution is defined by an equation, and this allows the probability of a given score to be calculated. Probability distributions depend on the form of the data obtained (see form of measurement data, above) and consequently are either discrete or continuous; in both cases discrete probability distributions and continuous probability distributions are statistical functions that provide a way of mapping out the likelihood that an observation will have a given value.
The different types of theoretical mathematical distributions are summarised in Table 4.1. Some of these probability distributions are outside the remit of this book and are only included here for completeness.
Table 4.1 Classification of probability distributions.
Probability distributions |
---|
Discrete distributions |