Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 21

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee

Скачать книгу

rel="nofollow" href="#fb3_img_img_bd600fba-647d-5d7f-98a7-03957b208c82.png" alt="images"/> images images images images images images images images images images

      A diagnostic to determine this in general is the variance inflation factor (images) for each predicting variable, which is defined as

equation

      where images is the images of the regression of the variable images on the other predicting variables. images gives the proportional increase in the variance of images compared to what it would have been if the predicting variables had been uncorrelated. There are no formal cutoffs as to what constitutes a large images, but collinearity is generally not a problem if the observed images satisfies

equation

      

      2.3.1 MODEL SELECTION

      We saw in Section 2.2.1 that hypothesis tests can be used to compare models. Unfortunately, there are several reasons why such tests are not adequate for the task of choosing among a set of candidate models for the appropriate model to use.

      In addition to the effects of correlated predictors on images‐tests noted earlier, partial images‐tests only can compare models that are nested (that is, where one is a special case of the other). Comparing a model based on images to one based on images, for example, is clearly important, but is impossible using these testing methods.

      Even ignoring these issues, hypothesis tests don't necessarily address the question a data analyst is most interested in. With a large enough sample, almost any estimated slope will be significantly different from zero, but that doesn't mean that the predictor provides additional useful predictive power. Similarly, in small samples, important effects might not be statistically significant at typical levels simply because of insufficient data. That is, there is a clear distinction between statistical significance and practical importance.

      In this section we discuss a strategy for determining a “best” model (or more correctly, a set of “best” models) among a larger class of candidate models, using objective measures designed to reflect a predictive point of view. As a first step, it is good to explicitly identify what should not be done. In recent years, it has become commonplace for databases to be constructed with hundreds (or thousands) of variables and hundreds of thousands (or millions) of observations. It is tempting to avoid issues related to choosing the potential set of candidate models by considering all of the variables as potential predictors in a regression model, limited only by available computing power. This would be a mistake. If too large a set of possible predictors is considered, it is very likely that variables will be identified as important just due to random chance. Since they do not reflect real relationships in the population, models based on them will predict poorly in the future, and interpretations of slope coefficients will just be mistaken explanations of what is actually random behavior. This sort of overfitting is known as “data dredging” and is among the most serious dangers when analyzing data.

      What

Скачать книгу