Biostatistics Decoded. A. Gouveia Oliveira

Чтение книги онлайн.

Читать онлайн книгу Biostatistics Decoded - A. Gouveia Oliveira страница 17

Biostatistics Decoded - A. Gouveia Oliveira

Скачать книгу

1.8 Inference with binary attributes.

      Let us return to the situation of a sample size of one and suppose that we want to estimate another characteristic of the balls in the population, for example, the average weight. This characteristic, or attribute, has an important difference from the color attribute, because weight can take many different values, not just two.

      Let us see if we can apply the same reasoning in the case of attributes taking many different values. To do so, we take a ball at random and measure its weight. Let us say that we get a weight of 60 g. What can we conclude about the average weight in the population? Now the answer is not so simple. If we knew that the balls were all about the same weight, we could say that the average weight in the population should be a value between, say, 50 and 70 g. If it were below or above those limits, it would be unlikely that a ball sampled at random would weigh 60 g.

An illustration of inference with interval attributes I.

      In summary, in order that the modern approach to sampling be valid, sampling must be at random. The representativeness of a sample is primarily determined by the sampling method used, not by the sample size. Sample size determines only the precision of the population estimates obtained with the sample.

      Now, if sample size has no relationship to representativeness, does this mean that sample size has no influence at all on the validity of the estimates? No, it does not. Sample size is of importance to validity because large sample sizes offer protection against accidental errors during sample selection and data collection, which might have an impact on our estimates. Examples of such errors are selecting an individual who does not actually belong to the population under study, measurement errors, transcription errors, and missing values.

An illustration of inference with interval attributes II.

      We have eliminated a lot of subjectivity by putting the notion of sample representativeness within a convenient framework. Now we must try to eliminate the remaining subjectivity in two other statements. First, we need to find a way to determine, objectively and reliably, the limits for population proportions and averages that are consistent with the samples. Second, we need to be more specific when we say that we are confident about those limits. Terms like confident, very confident, or quite confident lack objectivity, so it would be very useful if we could express quantitatively our degree of confidence in the estimates. In order to do that, as we have seen, we need a measure of the variation of the values of an attribute.

      The first problem can be solved by using the difference between the maximum and minimum values, a quantity commonly called the range, but this will not solve the problem of instability.

      The second problem can be minimized if, instead of using the minimum and maximum to describe the dispersion of values, we use the other measures of location, the lower and upper quartiles. The lower quartile (also called the 25th percentile) is the value below which one‐quarter, or 25%, of all the values in the dataset lie. The upper quartile (or 75th percentile) is the value below which three‐quarters, or 75%, of all the values in the dataset lie (note, incidentally, that the median is the same as the 50th percentile). The advantage of the quartiles over the limits is that they are more stable because the addition of one or two extreme values to the dataset will probably not change the quartiles.

An illustration of the measures of dispersion derived from measures of location.

      However, we still have the problem of having to deal with two values, which is certainly not as practical and easy to remember, and to reason with, as if we had just one value. One way around this could be to use the difference between the upper quartile and the lower quartile to describe the dispersion. This is called the interquartile range, but the interpretation of this value is not straightforward: it is not amenable to mathematical treatment and therefore it is not a very popular measure,

Скачать книгу