Probability with R. Jane M. Horgan
Чтение книги онлайн.
Читать онлайн книгу Probability with R - Jane M. Horgan страница 18
![Probability with R - Jane M. Horgan Probability with R - Jane M. Horgan](/cover_pre848404.jpg)
Mean: The mean is the sum of all values divided by the number of cases, excluding the missing values.
To obtain the mean of the data in Example 1.1 stored in
mean(downtime)[1] 25.04348
So the average downtime of all the computers in the laboratory is just over 25 minutes.
Going back to the original data in Exercise 1.1 stored in marks, to obtain the mean, write
mean(marks)
which gives
[1] 57.44
To obtain the mean marks for females, write
mean(marks[1:23]) [1] 65.86957
For males,
mean(marks[24:50]) [1] 50.25926
illustrating that the female average is substantially higher than the male average.
To obtain the mean of the corrected data in Exercise 1.1, recall that the mark of 86 for the 34th student on the list was an error, and that it should have been 46. We changed it with
marks[34] <- 46
The new overall average is
mean(marks) 56.64
and the new male average is
mean(marks[24:50]) [1] 48.77778
increasing the gap between the male and female averages even further.
If we perform a similar operation for the variables in the examination data given in Example 1.2, we run into trouble. Suppose we want the mean mark for Architecture in Semester 1. In R
mean(arch1)
gives
[1] NA
Recall that, in the results file, we recorded the missing marks with the special value
na.rm = T
or na.rm = TRUE,
(not available, remove) into the function.
For arch1, writing
mean(arch1, na.rm = TRUE)
yields
[1] 63.56897
To obtain the mean of all the variables in results file, we use the R function sapply
.
sapply(results, mean, na.rm = T)
yields
gender arch1 prog1 arch2 prog2 NA 63.56897 59.01709 51.97391 53.78378
Notice that a
sapply(results[2:5], mean, na.rm = TRUE)
gives
arch1 prog1 arch2 prog2 63.56897 59.01709 51.97391 53.78378
Median: The median is the middle value of the data set; 50% of the observations is less and 50% is more than this value.
In R
median(downtime)
yields
[1] 25
which means that 50% of the computers experienced less than 25 minutes of downtime, while 50% experienced more than 25 minutes of downtime.
Also,
median(marks) [1] 55.5
In both of these examples (
The median is particularly useful when there are extreme values in the data. Let us look at another example.
Examining the nine apps with greatest usage on your smartphone, you may find the usage statistics (in MB) are
App | Usage (MB) |
39.72 | |
Chrome | 35.37 |
5.73 | |
5.60 | |
System Account | 3.30 |
3.22 | |
Gmail | 2.52 |
Messenger | 1.71 |
Maps | 1.55 |
To enter the data, write
usage <- c(39.72, 35.27, 5.73, 5.6, 3.3, 3.22, 2.52, 1.71, 1.55)
The mean is
mean(usage) [1] 10.95778
while the median is
median(usage) [1] 3.3
Unlike the previous examples, where the mean and median were similar, here the mean is more than three times the median. Looking at the data again, you will notice that the usage of the first two apps, Facebook and Chrome, is much larger than the usages of the other apps in the data set. These values are the cause of the mean being so high. Such values are often designated as outliers and are analyzed separately. Omitting them and calculating the mean and