Probability with R. Jane M. Horgan
Чтение книги онлайн.
Читать онлайн книгу Probability with R - Jane M. Horgan страница 26
![Probability with R - Jane M. Horgan Probability with R - Jane M. Horgan](/cover_pre848404.jpg)
to get Fig. 3.18.
Figure 3.18 The Line of Best Fit for the Training Data
Next, we use the testing data to decide on the suitability of the line.
The coefficients of the line are obtained in R with
lm(formula = y_train ∼ x_train) Coefficients: (Intercept) x_train -0.9764 4.9959
The estimated values
y_est <- - 0.9764 + 4.9959 * x_test round(y_est, 1)
which gives
y_est 41.5 46.0 26.0 57.5 31.5 50.5 62.5 54.0 76.0 13.0
We now compare these estimated values with the observed values.
y_test 49.4 43.0 19.3 56.4 28.7 53.7 58.1 54.0 80.7 13.6plot(x_test, y_test, main = "Testing Data", font.main = 1) abline(lm(y_train ∼ x_train)) # plot the line of best fit segments(x_test, y_test, x_test, y_est)
gives Fig. 3.19. Here, segments
plots vertical lines between (x_test, y_test) and (x_test, y-est)
Figure 3.19 shows the observed values,
Figure 3.19 Differences Between Observed and Estimated
The line of best fit is the simplest regression model; it uses just one independent variable for prediction. In real‐life situations, many more independent variables or other models, such as, for example a quadratic, may be required, but for supervised learning, the approach is always the same:
Determine if there is a relationship between the dependent variable and the independent variables;
Fit the model to the training data;
Test the suitability of the model by predicting the ‐values in the testing data from the model and by comparing the observed and predicted ‐values.
The predictions from these models assumes that the trend, based on the data analyzed, continues to exist. Should the trend change, for example, when a house pricing model is estimated from data before an economic crash, the predictions will not be valid.
Regression analysis is just one of the many techniques from the area of Probability and Statistics that machine learning invokes. We will encounter more in later chapters. Should you wish to go into this topic more deeply, we recommend the book, A First Course in Machine Learning by Girolami (2015).
3.7 GRAPHICAL DISPLAYS VERSUS SUMMARY STATISTICS
Before we finish, let us look at a simple, classic example of the importance of using graphical displays to provide insight into the data. The example is that of Anscombe (1973), who provides four data sets, given in Table 3.3 and often referred to as the Anscombe Quartet. Each data set consists of two variables on which there are 11 observations.
TABLE 3.3 The Anscombe Quartet
Data Set 1 | Data Set 2 | Data Set 3 | Data Set 4 | ||||
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
12 | 10.84 |