Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 14

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee

Скачать книгу

effects of other variables cannot be ignored as a result of random assignment in the experiment. For observational data it is not possible to physically intervene in the experiment to “hold other variables fixed,” but the multiple regression framework effectively allows this to be done statistically.

      Having said this, we must recognize that in many situations, it is impossible from a practical point of view to change one predictor while holding all else fixed. Thus, while we would like to interpret a coefficient as accounting for the presence of other predictors in a physical sense, it is important (when dealing with observational data in particular) to remember that linear regression is at best only an approximation to the actual underlying random process.

      1.3.2 MEASURING THE STRENGTH OF THE REGRESSION RELATIONSHIP

      The least squares estimates possess an important property:

as a measure of the strength of the regression relationship, where

      The

value (also called the coefficient of determination) estimates the population proportion of variability in
accounted for by the best linear combination of the predictors. Values closer to
indicate a good deal of predictive power of the predictors for the target variable, while values closer to
indicate little predictive power. An equivalent representation of
is

      where

      is the sample correlation coefficient between

and
(this correlation is called the multiple correlation coefficient). That is,
is a direct measure of how similar the observed and fitted target values are.

      It can be shown that

is biased upwards as an estimate of the population proportion of variability accounted for by the regression. The adjusted
corrects this bias, and equals

is large relative to
(that is, unless the number of predictors is large relative to the sample size),
and
will be close to each other, and the choice of which to use is a minor concern. What is perhaps more interesting is the nature of
as providing an explicit tradeoff between the strength of the fit (the first term, with larger
corresponding to stronger fit and larger
) and the complexity of the model (the second term, with larger
corresponding to more complexity and smaller
). This tradeoff of fidelity to the data versus simplicity will be important in the discussion of model selection in Section 2.3.1.

      The only parameter left unaccounted for in the estimation scheme is the variance of the errors

. An unbiased estimate is provided by the residual mean square,

really say anything of value about
? This isn't a question that can be answered completely statistically; it requires knowledge and understanding of the data and the underlying random process (that is, it requires context). Recall that the model assumes that the errors are normally distributed with standard deviation
. This means that, roughly speaking,
of the time an observed
value falls within
of the expected response

      

can be estimated for any given set of
values using

that can

Скачать книгу