Biostatistics Decoded. A. Gouveia Oliveira

Чтение книги онлайн.

Читать онлайн книгу Biostatistics Decoded - A. Gouveia Oliveira страница 19

Biostatistics Decoded - A. Gouveia Oliveira

Скачать книгу

1.13 The n − 1 divisor of the sum of squares.

      Using a computer’s random number generator, we obtained random samples of a variable with variance equal to 1. This is the population variance of that variable. Starting with samples of size 2, we obtained 10 000 random samples and computed their sample variances using the n divisor. Next, we computed the average of those 10 000 sample variances and retained the result. We then repeated the procedure with samples of size 3, 4, 5, and so on up to 100.

      The plot of the averaged value of sample variances against sample size is represented by the solid line in Figure 1.12. It can clearly be seen that, regardless of the sample size, the variance computed with the n divisor is on average less than the population variance, and the deviation from the true variance increases as the sample size decreases.

      Now let us repeat the procedure, exactly as before, but this time using the n − 1 divisor. The plot of the average sample variance against sample size is shown in Figure 1.13. The solid line is now exactly over 1, the value of the population variance, for all sample sizes.

      This experiment clearly illustrates that, contrary to the sample variance using the n divisor, the sample variance using the n − 1 divisor is an unbiased estimator of the population variance.

      Degrees of freedom is a central notion in statistics that applies to all problems of estimation of quantities in populations from the observations made on samples. The number of degrees of freedom is the number of values in the calculation of a quantity that are free to vary. The general rule for finding the number of degrees of freedom for any statistic that estimates a quantity in the population is to count the number of independent values used in the calculation minus the number of population quantities that were replaced by sample quantities during the calculation.

      In the calculation of the variance, instead of summing the squared differences of each value to the population mean, we summed the squared differences to the sample mean. Therefore, we replaced a population parameter by a sample parameter and, because of that, we lose one degree of freedom. Therefore, the number of degrees of freedom of a sample variance is n − 1.

      As a binary variable is a numeric variable, in addition to calculating a mean, which is called a proportion in binary variables, we can also calculate a variance. The computation is the same as for interval variables: the differences of each observation from the mean are squared, then summed up and divided by the number of observations. With binary l variables there is no need to correct the denominator and the sum of squares is divided by n.

      Means and variances have some interesting properties that deserve mention. Knowledge of some of these properties will be very helpful when analyzing the data, and will be required several times in the following sections. Regardless, they are all intuitive and easy to understand.

      With a computer, we generated random numbers between 0 and 1, representing observations from a continuous attribute with uniform distribution, which we will call variable A. This attribute is called a random variable because it can take any value from a set of possible distinct values, each with a given probability. In this case, variable A can take any value from the set of real numbers between 0 and 1, all with equal probability. Hence the probability distribution of variable A is called the uniform distribution.

Graphs depict two random variables with uniform distribution. Graphs depict the properties of means and variances.

      When observations from two independent random variables are added or subtracted, the mean of the resulting variable will be the sum or the subtraction, respectively, of the means of the two variables. In both cases, however, the variance of the new variable will be the sum of the variances of the two variables. The right graph in Figure 1.16

Скачать книгу