Data Science For Dummies. Lillian Pierson
Чтение книги онлайн.
Читать онлайн книгу Data Science For Dummies - Lillian Pierson страница 29
Credit: Python for Data Science Essential Training Part 1, LinkedIn.com
FIGURE 4-7: Spotting outliers with a Tukey boxplot.
Detecting outliers with multivariate analysis
Sometimes outliers show up only within combinations of data points from disparate variables. These outliers wreak havoc on machine learning algorithms, so it’s important to detect and remove them. You can use multivariate analysis of outliers to do this. A multivariate approach to outlier detection involves considering two or more variables at a time and inspecting them together for outliers. You can use one of several methods, including:
A scatter-plot matrix
Boxplotting
Density-based spatial clustering of applications with noise (DBScan) — as discussed in Chapter 5
Principal component analysis (PCA, as shown in Figure 4-8)
Credit: Python for Data Science Essential Training Part 2, LinkedIn.com
FIGURE 4-8: Using PCA to spot outliers.
Introducing Time Series Analysis
A time series is just a collection of data on attribute values over time. Time series analysis is performed to predict future instances of the measure based on the past observational data. To forecast or predict future values from data in your dataset, use time series techniques.
Identifying patterns in time series
Time series exhibit specific patterns. Take a look at Figure 4-9 to gain a better understanding of what these patterns are all about. Constant time series remain at roughly the same level over time but are subject to some random error. In contrast, trended series show a stable linear movement up or down. Whether constant or trended, time series may also sometimes exhibit seasonality — predictable, cyclical fluctuations that reoccur seasonally throughout a year. As an example of seasonal time series, consider how many businesses show increased sales during the holiday season.
FIGURE 4-9: A comparison of patterns exhibited by time series.
If you’re including seasonality in your model, incorporate it in the quarterly, monthly, or even biannual period — wherever it’s appropriate. Time series may show nonstationary processes — unpredictable cyclical behavior that isn’t related to seasonality and that results from economic or industry-wide conditions instead. Because they’re not predictable, nonstationary processes can’t be forecasted. You must transform nonstationary data to stationary data before moving forward with an evaluation.
Take a look at the solid lines shown earlier, in Figure 4-9. These represent the mathematical models used to forecast points in the time series. The mathematical models shown represent good, precise forecasts because they’re a close fit to the actual data. The actual data contains some random error, thus making it impossible to forecast perfectly.
For help getting started with time series within the context of the R programming language, be sure to visit the companion website to this book (
http://businessgrowth.ai/
), where you’ll find a free training and coding demonstration of time series data visualization in R.
Modeling univariate time series data
Similar to how multivariate analysis is the analysis of relationships between multiple variables, univariate analysis is the quantitative analysis of only one variable at a time. When you model univariate time series, you’re modeling time series changes that represent changes in a single variable over time.
Autoregressive moving average (ARMA) is a class of forecasting methods that you can use to predict future values from current and historical data. As its name implies, the family of ARMA models combines autoregression techniques (analyses that assume that previous observations are good predictors of future values and perform an autoregression analysis to forecast for those future values) and moving average techniques — models that measure the level of the constant time series and then update the forecast model if any changes are detected. If you’re looking for a simple model or a model that will work for only a small dataset, the ARMA model isn’t a good fit for your needs. An alternative in this case might be to just stick with simple linear regression. In Figure 4-10, you can see that the model forecast data and the actual data are a close fit.
To use the ARMA model for reliable results, you need to have at least 50 observations.
FIGURE 4-10 An example of an ARMA forecast model.
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.