Читать онлайн книгу - Probability with R. Jane M. Horgan. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Скачать книгу

∼ x_train))

to get Fig. 3.18.

Figure 3.18 The Line of Best Fit for the Training Data

Next, we use the testing data to decide on the suitability of the line.

The coefficients of the line are obtained in R with

lm(formula = y_train ∼ x_train) Coefficients: (Intercept) x_train -0.9764 4.9959

The estimated values are calculated in R as follows:

y_est <- - 0.9764 + 4.9959 * x_test round(y_est, 1)

which gives

y_est 41.5 46.0 26.0 57.5 31.5 50.5 62.5 54.0 76.0 13.0

We now compare these estimated values with the observed values.

y_test 49.4 43.0 19.3 56.4 28.7 53.7 58.1 54.0 80.7 13.6plot(x_test, y_test, main = "Testing Data", font.main = 1) abline(lm(y_train ∼ x_train)) # plot the line of best fit segments(x_test, y_test, x_test, y_est)

gives Fig. 3.19. Here, segments plots vertical lines between (x_test, y_test) and (x_test, y-est)

Figure 3.19 shows the observed values, , along with the values estimated from the line, . The vertical lines illustrate the differences between them. A decision has to be made then as to whether or not the line is a “good fit” or whether an alternative model should be investigated.

Figure 3.19 Differences Between Observed and Estimated

Values in the Testing Set

The line of best fit is the simplest regression model; it uses just one independent variable for prediction. In real‐life situations, many more independent variables or other models, such as, for example a quadratic, may be required, but for supervised learning, the approach is always the same:

Determine if there is a relationship between the dependent variable and the independent variables;

Fit the model to the training data;

Test the suitability of the model by predicting the ‐values in the testing data from the model and by comparing the observed and predicted ‐values.

The predictions from these models assumes that the trend, based on the data analyzed, continues to exist. Should the trend change, for example, when a house pricing model is estimated from data before an economic crash, the predictions will not be valid.

Regression analysis is just one of the many techniques from the area of Probability and Statistics that machine learning invokes. We will encounter more in later chapters. Should you wish to go into this topic more deeply, we recommend the book, A First Course in Machine Learning by Girolami (2015).

3.7 GRAPHICAL DISPLAYS VERSUS SUMMARY STATISTICS

Before we finish, let us look at a simple, classic example of the importance of using graphical displays to provide insight into the data. The example is that of Anscombe (1973), who provides four data sets, given in Table 3.3 and often referred to as the Anscombe Quartet. Each data set consists of two variables on which there are 11 observations.

TABLE 3.3 The Anscombe Quartet

Скачать книгу

Data Set 1		Data Set 2		Data Set 3		Data Set 4
x1	y1	x2	y2	x3	y3	x4	y4
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.10	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.10	4	5.39	19	12.50
12	10.84

Probability with R. Jane M. Horgan

Чтение книги онлайн.

Читать онлайн книгу Probability with R - Jane M. Horgan страница 26

Информация о книге:

3.7 GRAPHICAL DISPLAYS VERSUS SUMMARY STATISTICS