Introduction to Linear Regression Analysis. Douglas C. Montgomery
Чтение книги онлайн.
Читать онлайн книгу Introduction to Linear Regression Analysis - Douglas C. Montgomery страница 29
![Introduction to Linear Regression Analysis - Douglas C. Montgomery Introduction to Linear Regression Analysis - Douglas C. Montgomery](/cover_pre887212.jpg)
The magnitude of R2 also depends on the range of variability in the regressor variable. Generally R2 will increase as the spread of the x’s increases and decrease as the spread of the x’s decreases provided the assumed model form is correct. By the delta method (also see Hahn 1973), one can show that the expected value of R2 from a straight-line regression is approximately
Clearly the expected value of R2 will increase (decrease) as Sxx (a measure of the spread of the x’s) increases (decreases). Thus, a large value of R2 may result simply because x has been varied over an unrealistically large range. On the other hand, R2 may be small because the range of x was too small to allow its relationship with y to be detected.
There are several other misconceptions about R2. In general, R2 does not measure the magnitude of the slope of the regression line. A large value of R2 does not imply a steep slope. Furthermore, R2 does not measure the appropriateness of the linear model, for R2 will often be large even though y and x are nonlinearly related. For example, R2 for the regression equation in Figure 2.3b will be relatively large even though the linear approximation is poor. Remember that although R2 is large, this does not necessarily imply that the regression model will be an accurate predictor.
2.7 A SERVICE INDUSTRY APPLICATION OF REGRESSION
A hospital is implementing a program to improve service quality and productivity. As part of this program the hospital management is attempting to measure and evaluate patient satisfaction. Table B.17 contains some of the data that have been collected on a random sample of 25 recently discharged patients. The response variable is satisfaction, a subjective response measure on an increasing scale. The potential regressor variables are patient age, severity (an index measuring the severity of the patient’s illness), an indicator of whether the patient is a surgical or medical patient (0 = surgical, 1 = medical), and an index measuring the patient’s anxiety level. We start by building a simple linear regression model relating the response variable satisfaction to severity.
Figure 2.6 is a scatter diagram of satisfaction versus severity. There is a relatively mild indication of a potential linear relationship between these two variables. The output from JMP for fitting a simple linear regression model to these data is shown in Figure 2.7. JMP is an SAS product that is a menu-based PC statistics package with an extensive array of regression modeling and analysis capabilities.
At the top of the JMP output is the scatter plot of the satisfaction and severity data, along with the fitted regression line. The straight line fit looks reasonable although there is considerable variability in the observations around the regression line. The second plot is a graph of the actual satisfaction response versus the predicted response. If the model were a perfect fit to the data all of the points in this plot would lie exactly along the 45-degree line. Clearly, this model does not provide a perfect fit. Also, notice that while the regressor variable is significant (the ANOVA F statistic is 17.1114 with a P value that is less than 0.0004), the coefficient of determination R2 = 0.43. That is, the model only accounts for about 43% of the variability in the data. It can be shown by the methods discussed in Chapter 4 that there are no fundamental problems with the underlying assumptions or measures of model adequacy, other than the rather low value of R2.
Figure 2.6 Scatter diagram of satisfaction versus severity.
Figure 2.7 JMP output for the simple linear regression model for the patient satisfaction data.
Low values for R2 occur occasionally in practice. The model is significant, there are no obvious problems with assumptions or other indications of model inadequacy, but the proportion of variability explained by the model is low. Now this is not an entirely disastrous situation. There are many situations where explaining 30 to 40% of the variability in y with a single predictor provides information of considerable value to the analyst. Sometimes, a low value of R2 results from having a lot of variability in the measurements of the response due to perhaps the type of measuring instrument being used, or the skill of the person making the measurements. Here the variability in the response probably arises because the response is an expression of opinion, which can be very subjective. Also, the measurements are taken on human patients, and there can be considerably variability both within people and between people. Sometimes, a low value of R2 is a result of a poorly specified model. In these cases the model can often be improved by the addition of one or more predictor or regressor variables. We see in Chapter 3 that the addition of another regressor results in considerable improvement of this model.
2.8 DOES PITCHING WIN BASEBALL GAMES?
Part of the never-ending debate about what makes a winning baseball team is the theory that a team cannot win consistently without good pitching. Baseball is a sport deeply rooted in analytics and there is an ever-growing body of data available. Major league baseball now has a system in all stadiums that includes cameras and other sensors to collect data on every pitch, including pitch speed, player positions, launch angle for balls hit in the air, and so forth. Table B.22 contains a summary of performance for 2016 for all National and American League baseball teams. The response variable is the number of games won and among the various statistics listed is the team earned run average (ERA), a standard measure of pitching performance, with low values of team ERA generally attributed to an outstanding pitching staff. Figure 2.8 is the JMP output from fitting a linear regression model to wins versus the ERA data. The plots of wins versus ERA and actual versus predicted show that there is definitely a linear relationship between these variables. The model is significant and it seems that one point of team ERA is equivalent to about 18.6 wins. JMP also produces a plot of residuals versus the predicted number of wins. Residual plots such as this are useful in assessing model adequacy. We will discuss this in detail later but this plot indicates that there is no structure in the relationship between the residuals and the predicted number of wins. This is one indication that the model fit is satisfactory. However, the model only explains about 63% of the variability in the response. While this is not bad, it does suggest that there may be other useful explanatory variables.
Figure 2.8 JMP output for the model relating team wins to team ERA for the 2016 baseball season.
2.9 USING SAS®