Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis

Чтение книги онлайн.

Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics Using Python - Daniel J. Denis страница 18

Applied Univariate, Bivariate, and Multivariate Statistics Using Python - Daniel J. Denis

Скачать книгу

estimators (in terms of selecting optimal values for them) should be in order to maximize or minimize a function. For example, in the case of a simple linear regression, if we take a model of the type yi = α + βxi + εi and fit it to some data, the model “learns” from the data what are the best values for a and b, which are estimators for α and β, respectively. Given that we are using ordinary least-squares as our method of estimation, the regression model “learns” what a and b should be such that the sum of squared errors is kept to a minimum value (if you don’t know what all this means, no worries, you’ll learn it in Chapter 7). The point here for this discussion is that the “learning” or “training” consists simply of selecting scalars for a and b such that the function of minimizing the sum of squared errors is satisfied. Once that occurs, the model is said to have learned or been “trained” from the data. This is, at its most essential and rudimentary level, what statistical learning actually means in many (not all) contexts. If we subject that model to new data after that, thus “sharpening” its scalars, the model “updates” what its estimators should be in order to continue optimizing a function. Note that this more or less parallels the idea of human learning, in that the model (or “you”) is “learning from experience” as a new experience is incorporated into knowledge. For example, a worker learns how to maximize his or her potential in a job through trial and error, otherwise known as “experience.” If one day his or her boss corrects him or her, that new “data” is incorporated into the learning mechanism. If on another day the individual is reinforced for doing something right, that is also incorporated into the learning mechanism. Of course, we cannot see the scalars or estimators (they are largely metaphorical in this case), but you get the idea. Learning “optimizes” some function though exposure to new experience. In classical learning theory in psychology, for instance, the rat in a Skinner box learns that if he presses the lever, he will receive a pellet of food. If he doesn’t press the lever, he doesn’t receive food. The rat is optimizing the function (its in his little brain, and its metaphorical, we can’t see it) that will allow him to distinguish which response gets the food. This is learning! When the rat is “trained” enough, he starts making predictions nearly perfectly with very few errors. So it also is with the statistical model; it does an increasingly good job at “getting it right” as it is trained on increasingly more data (i.e. more “experience”). It also “learns” from what it did wrong, just as the rat learns that if he doesn’t press the lever, he doesn’t eat.

      Now, in the spirit of statistical learning and “training,” validating a model has become equally emphasized, in the sense that after a model is trained on one set of data, it should be applied to a similar set of data to estimate the error rate on that new set. But what does this mean? How can we understand this idea? Easily! Here are some easy examples of where this occurs:

       The pilot learns in the simulator or test flights and then his or her knowledge is “validated” on a new flight. The pilot was “trained” in landing in a thunderstorm yesterday and now that knowledge (model) will be evaluated in a new flight on a new storm.

       Rafael Nadal, tennis player, learns from his previous match how to not make errors when returning the ball. That learning is evaluated on new data, which is a new tennis match.

       A student in a statistics class learns from the first test how to adjust his or her study strategies. That knowledge is validated on test 2 to see how much was learned.

      In this book, while it can be said that we do “train” models by fitting them, we do not cross-validate them on new data. Since it is essentially an introduction and primer, we do not take that additional step. However, you should know that such a step is often a good one to take if you have such data at your disposal to make cross-validation do-able. In many cases, scientists may not have such cross-validation data available to them, at least not yet. Hence, “splitting the sample” into a training and test set may not be do-able due to the size of the data. However, that does not necessarily mean testing cannot be done. It can be, on a new data set that is assumed to be drawn from the same population as the original test set. Techniques for cross-validation do exist that minimize having to collect very large validation samples (e.g. see James et al., 2013). Further, to use one of our previous metaphors, validating the pilot’s skill may be delayed until a new storm is available; it does not necessarily have to be done today. Hence, and in general, when you fit a model, you should always have it in mind to validate that model on new data, data that was not used in the training of the model. Why is this last point important? Quite simply because if the pilot is testing his or her skills on the same storm in which he or she was trained, it’s hardly a test at all, because he or she already knows that particular storm and knows the intricacies and details of that storm, so it is not really a test of new skills; it is more akin to a test of how well he or she remembers how to deal with that specific storm and (returning to our statistical discussion) capitalizes on chance factors. This is why if you are to cross-validate a model, it should be done on new “test” data, never the original training data. If you do not cross-validate the model, you can generally expect your model fit on the training data in most cases to be more optimistic than not, such that it will appear that the model fits “better” than it actually would on new data. This is the primary reason why cross-validation of a model is strongly encouraged.

Скачать книгу