Читать онлайн книгу - Handbook of Regression Analysis With Applications in R. Samprit Chatterjee. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Handbook of Regression Analysis With Applications in R - Samprit Chatterjee

Скачать книгу

rel="nofollow" href="#fb3_img_img_bd600fba-647d-5d7f-98a7-03957b208c82.png" alt="images"/>

This ratio describes by how much the variances of the estimated slope coefficients are inflated due to observed collinearity relative to when the predictors are uncorrelated. It is clear that when the correlation is high, the variability (and hence the instability) of the estimated slopes can increase dramatically.

A diagnostic to determine this in general is the variance inflation factor () for each predicting variable, which is defined as

where is the of the regression of the variable on the other predicting variables. gives the proportional increase in the variance of compared to what it would have been if the predicting variables had been uncorrelated. There are no formal cutoffs as to what constitutes a large , but collinearity is generally not a problem if the observed satisfies

where is the usual for the regression fit. This means that either the predictors are more related to the target variable than they are to each other, or they are not related to each other very much. In either case coefficient estimates are not very likely to be very unstable, so collinearity is not a problem. If collinearity is present, a simplified model should be considered, but this is only a general guideline; sometimes two (or more) collinear predictors might be needed in order to adequately model the target variable. In the next section we discuss a methodology for judging the adequacy of fitted models and comparing them.

2.3 Methodology

2.3.1 MODEL SELECTION

We saw in Section 2.2.1 that hypothesis tests can be used to compare models. Unfortunately, there are several reasons why such tests are not adequate for the task of choosing among a set of candidate models for the appropriate model to use.

In addition to the effects of correlated predictors on images ‐tests noted earlier, partial images ‐tests only can compare models that are nested (that is, where one is a special case of the other). Comparing a model based on images to one based on images , for example, is clearly important, but is impossible using these testing methods.

Even ignoring these issues, hypothesis tests don't necessarily address the question a data analyst is most interested in. With a large enough sample, almost any estimated slope will be significantly different from zero, but that doesn't mean that the predictor provides additional useful predictive power. Similarly, in small samples, important effects might not be statistically significant at typical levels simply because of insufficient data. That is, there is a clear distinction between statistical significance and practical importance.

In this section we discuss a strategy for determining a “best” model (or more correctly, a set of “best” models) among a larger class of candidate models, using objective measures designed to reflect a predictive point of view. As a first step, it is good to explicitly identify what should not be done. In recent years, it has become commonplace for databases to be constructed with hundreds (or thousands) of variables and hundreds of thousands (or millions) of observations. It is tempting to avoid issues related to choosing the potential set of candidate models by considering all of the variables as potential predictors in a regression model, limited only by available computing power. This would be a mistake. If too large a set of possible predictors is considered, it is very likely that variables will be identified as important just due to random chance. Since they do not reflect real relationships in the population, models based on them will predict poorly in the future, and interpretations of slope coefficients will just be mistaken explanations of what is actually random behavior. This sort of overfitting is known as “data dredging” and is among the most serious dangers when analyzing data.

The set of possible models should ideally be chosen before seeing any data based on as thorough an understanding of the underlying random process as possible. Potential predictors should be justifiable on theoretical grounds if at all possible. This is by necessity at least somewhat subjective, but good basic principles exist. Potential models to consider should be based on the scientific literature and previous relevant experiments. In particular, if a model simply doesn't “make sense,” it shouldn't be considered among the possible candidates. That does not mean that modifications and extensions of models that are suggested by the analysis should be ignored (indeed, this is the subject of the next three chapters), but an attempt to keep models grounded in what is already understood about the underlying process is always a good idea.

What

Скачать книгу

Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 21

Информация о книге:

2.3 Methodology

2.3.1 MODEL SELECTION