Data Science For Dummies. Lillian Pierson

Чтение книги онлайн.

Читать онлайн книгу Data Science For Dummies - Lillian Pierson страница 28

Data Science For Dummies - Lillian Pierson

Скачать книгу

regression

      Logistic regression is a machine learning method you can use to estimate values for a categorical target variable based on your selected features. Your target variable should be numeric and should contain values that describe the target’s class — or category. One cool aspect of logistic regression is that, in addition to predicting the class of observations in your target variable, it indicates the probability for each of its estimates. Though logistic regression is like linear regression, its requirements are simpler, in that:

       There doesn't need to be a linear relationship between the features and target variable.

       Residuals don’t have to be normally distributed.

       Predictive features aren’t required to have a normal distribution.

      When deciding whether logistic regression is a good choice for you, consider the following limitations:

       Missing values should be treated or removed.

       Your target variable must be binary or ordinal. Binary classification assigns a 1 for “yes” and a 0 for “no.”

       Predictive features should be independent of each other.

      Logistic regression requires a greater number of observations than linear regression to produce a reliable result. The rule of thumb is that you should have at least 50 observations per predictive feature if you expect to generate reliable results.

      

Predicting survivors on the Titanic is the classic practice problem for newcomers to learn logistic regression. You can practice it and see lots of examples of this problem worked out over on Kaggle. (www.kaggle.com/c/titanic).

      Ordinary least squares (OLS) regression methods

      Ordinary least squares (OLS) is a statistical method that fits a linear regression line to a dataset. With OLS, you do this by squaring the vertical distance values that describe the distances between the data points and the best-fit line, adding up those squared distances, and then adjusting the placement of the best-fit line so that the summed squared distance value is minimized. Use OLS if you want to construct a function that’s a close approximation to your data.

      

As always, don’t expect the actual value to be identical to the value predicted by the regression. Values predicted by the regression are simply estimates that are most similar to the actual values in the model.

      OLS is particularly useful for fitting a regression line to models containing more than one independent variable. In this way, you can use OLS to estimate the target from dataset features.

      

When using OLS regression methods to fit a regression line that has more than one independent variable, two or more of the variables may be interrelated. When two or more independent variables are strongly correlated with each other, this is called multicollinearity. Multicollinearity tends to adversely affect the reliability of the variables as predictors when they’re examined apart from one another. Luckily, however, multicollinearity doesn’t decrease the overall predictive reliability of the model when it’s considered collectively.

      Many statistical and machine learning approaches assume that your data has no outliers. Outlier removal is an important part of preparing your data for analysis. In this section, you see a variety of methods you can use to discover outliers in your data.

      Analyzing extreme values

      Outliers are data points with values that are significantly different from the majority of data points comprising a variable. It’s important to find and remove outliers because, left untreated, they skew variable distribution, make variance appear falsely high, and cause a misrepresentation of intervariable correlations.

      Outliers fall into the following three categories:

       Point: Point outliers are data points with anomalous values compared to the normal range of values in a feature.

       Contextual: Contextual outliers are data points that are anomalous only within a specific context. To illustrate, if you’re inspecting weather station data from January in Orlando, Florida, and you see a temperature reading of 23 degrees F, this would be quite anomalous because the average temperature there is 70 degrees F in January. But consider if you were looking at data from January at a weather station in Anchorage, Alaska — a temperature reading of 23 degrees F in this context isn’t anomalous at all.

       Collective: These outliers appear nearby one another, all having similar values that are anomalous to the majority of values in the feature.

      You can detect outliers using either a univariate or multivariate approach, as spelled out in the next two sections.

      Detecting outliers with univariate analysis

      Univariate outlier detection is where you look at features in your dataset and inspect them individually for anomalous values. You can choose from two simple methods for doing this:

       Tukey outlier labeling

       Tukey boxplotting

      Tukey boxplotting is an exploratory data analysis technique that’s useful for visualizing the distribution of data within a numeric variable by visualizing that distribution with quartiles. As you might guess, the Tukey boxplot was named after its inventor, John Tukey, an American mathematician who did most of his work back in the 1960s and 70s. Tukey outlier labeling refers to labeling data points (that lie beyond the minimum and maximum extremes of a box plot) as outliers.

      

Here’s a good rule of thumb:

      a = Q1 – 1.5*IQR

      and

      b = Q3 + 1.5*IQR.

      If

Скачать книгу