Handbook of Regression Analysis With Applications in R. Samprit Chatterjee
Чтение книги онлайн.
Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 13
![Handbook of Regression Analysis With Applications in R - Samprit Chatterjee Handbook of Regression Analysis With Applications in R - Samprit Chatterjee](/cover_pre848485.jpg)
(1.6)
or
1.2.3 ASSUMPTIONS
The least squares criterion will not necessarily yield sensible results unless certain assumptions hold. One is given in (1.1) — the linear model should be appropriate. In addition, the following assumptions are needed to justify using least squares regression.
1 The expected value of the errors is zero ( for all ). That is, it cannot be true that for certain observations the model is systematically too low, while for others it is systematically too high. A violation of this assumption will lead to difficulties in estimating . More importantly, this reflects that the model does not include a necessary systematic component, which has instead been absorbed into the error terms.
2 The variance of the errors is constant ( for all ). That is, it cannot be true that the strength of the model is greater for some parts of the population (smaller ) and less for other parts (larger ). This assumption of constant variance is called homoscedasticity, and its violation (nonconstant variance) is called heteroscedasticity. A violation of this assumption means that the least squares estimates are not as efficient as they could be in estimating the true parameters, and better estimates are available. More importantly, it also results in poorly calibrated confidence and (especially) prediction intervals.
3 The errors are uncorrelated with each other. That is, it cannot be true that knowing that the model underpredicts (for example) for one particular observation says anything at all about what it does for any other observation. This violation most often occurs in data that are ordered in time (time series data), where errors that are near each other in time are often similar to each other (such time‐related correlation is called autocorrelation). Violation of this assumption means that the least squares estimates are not as efficient as they could be in estimating the true parameters, and more importantly, its presence can lead to very misleading assessments of the strength of the regression.
4 The errors are normally distributed. This is needed if we want to construct any confidence or prediction intervals, or hypothesis tests, which we usually do. If this assumption is violated, hypothesis tests and confidence and prediction intervals can be very misleading.
Since violation of these assumptions can potentially lead to completely misleading results, a fundamental part of any regression analysis is to check them using various plots, tests, and diagnostics.
1.3 Methodology
1.3.1 INTERPRETING REGRESSION COEFFICIENTS
The least squares regression coefficients have very specific meanings. They are often misinterpreted, so it is important to be clear on what they mean (and do not mean). Consider first the intercept,
1 : The estimated expected value of the target variable when the predictors are all equal to zero.
Note that this might not have any physical interpretation, since a zero value for the predictor(s) might be impossible, or might never come close to occurring in the observed data. In that situation, it is pointless to try to interpret this value. If all of the predictors are centered to have zero mean, then
The estimated coefficient for the
1 : The estimated expected change in the target variable associated with a one unit change in the th predicting variable, holding all else in the model fixed.
There are several noteworthy aspects to this interpretation. First, note the word associated — we cannot say that a change in the target variable is caused by a change in the predictor, only that they are associated with each other. That is, correlation does not imply causation.
Another key point is the phrase “holding all else in the model fixed,” the implications of which are often ignored. Consider the following hypothetical example. A random sample of college students at a particular university is taken in order to understand the relationship between college grade point average (GPA) and other variables. A model is built with college GPA as a function of high school GPA and the standardized Scholastic Aptitude Test (SAT), with resultant least squares fit
It is tempting to say (and many people would say) that the coefficient for SAT score has the “wrong sign,” because it says that higher values of SAT are associated with lower values of college GPA. This is not correct. The problem is that it is likely in this context that what an analyst would find intuitive is the marginal relationship between college GPA and SAT score alone (ignoring all else), one that we would indeed expect to be a direct (positive) one. The regression coefficient does not say anything about that marginal relationship. Rather, it refers to the conditional (sometimes called partial) relationship that takes the high school GPA as fixed, which is apparently that higher values of SAT are associated with lower values of college GPA, holding high school GPA fixed. High school GPA and SAT are no doubt related to each other, and it is quite likely that this relationship between the predictors would complicate any understanding of, or intuition about, the conditional relationship between college GPA and SAT score. Multiple regression coefficients should not be interpreted marginally; if you really are interested in the relationship between the target and a single predictor alone, you should simply do a regression of the target on only that variable. This does not mean that multiple regression coefficients are uninterpretable, only that care is necessary when interpreting them.
Another common use of multiple regression that depends on this conditional interpretation of the coefficients is to explicitly include “control” variables in a model in order to try to account for their effect statistically. This is particularly important in observational data (data that are not the result of a designed experiment), since in that