Data Analytics in Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 15

Data Analytics in Bioinformatics - Группа авторов

Скачать книгу

in Refs. [51–53]. In the process of regression analysis on the heart disease dataset, the following numerical interpretation is obtained and presented in Table 1.1.

      Where,

       Multiple R (Co-relation Coefficient): It depicts the strength of a linear relationship between two variables i.e. age and cholesterol of a human. This value always lies between −1 and +1. The obtained value i.e. 0.972834634 indicated that there is a good relationship between age and cholesterol level.

       R2: It is the coefficient of determination i.e. the goodness of fit. The obtained value is 0.946407225 which indicates that 95% of the values of the heart disease dataset fit the regression model.

       Adjusted R2: This variable is an upgraded version of R2. This value tries to adjust the predictor number in the model. This value increases when any new term improves the performance of the model more than the expectation and viceversa. The obtained value i.e. 0.945430663 indicates that the model is not performing well so there is a need for modification in predictor number.

       Standard Error: It measures the precision of the regression model, the smaller the number, the more accurate the results are. The value obtained is 12.7814549 which indicates that the results are near to accurate value. The Standard Error depicts the measure of how well the data has been approximated.

      Table 1.1 Regression statistics.

Regression Statistics
Multiple R 0.972834634
R Square 0.946407225
Adjusted R Square 0.945430663
Standard Error 12.7814549
Observations 1,025

       1.4.1 Logistic Regression

      Logistic Regression is a statistical model used for identifying the probability of a class with the help of binary dependent variables i.e. Yes or No. It indicates whether a class belongs to the Yes category or the No category. For example, after executing an event on an object the results maybe Win or Loss, Pass or Fail, Accept or Not-Accept, etc. The mathematical representation of the Logistic Regression model is done by two indicator variables i.e. 0 and 1. It is different from the Linear Regression technique as depicted in Ref. [54]. As logistic regression has its importance in the real-life classification problems as depicted in Refs. [55, 56], different fields like Medical Sciences, Social Sciences, ML are using this model in their various field of operations.

      The Logistic Regression is performed on the heart disease dataset [41]. The Receiver Operating Characteristics (ROC) is calculated that is based on the true positive rate that is plotted on the y-axis and the false positive rate that is plotted on the x-axis. After performing the logistic regression in python (Google Colab), the outcome is represented in Figure 1.11 and Table 1.2. Figure 1.11 represents the ROC curve and Table 1.2 represents the Area under the ROC Curve (AUC).

      At the time of processing, the AUC value obtained (Table 1.2) on training data is 0.8374022, but when the data is processed for testing then the obtained result is outstanding (i.e. 0.9409523). This indicates that the model is more than 90% efficient for classification. In the next section, the difference between Linear and Logistic Regression is discussed.

      Figure 1.11 ROC curve for logistic regression.

      Table 1.2 AUC: Logistic regression.

Parameter Data Value Result
The area under Training Data 0.8374022 Excellent
the ROC Curve (AUC) Test Data 0.9409523 Outstanding
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding

       1.4.2 Difference between Linear & Logistic Regression

      Linear and Logistics regression are two common types of regression used for prediction. The result of the prediction is represented with the help of numeric variables. The difference between linear and logistic regression is depicted in Table 1.3 for easy understanding.

      Linear regression is used to model the data by using a straight line whereas the logistic regression deals with the modeling of probability of events in a bi-variate manner that is occurring as a linear function of dependent variables. Few other types of regression analysis are depicted by different scientists and listed below.

S. No. Parameter Linear regression Logistic regression
1 Purpose Used for solving regression problems. Used for solving classification problems.
2 Variables Involved Continuous Variables Categorical Variables
3 Objective Finding of best-fit-line and predicting the output. Finding of s-curve and classifying the samples.
4 Output Continuous Variables such as age, price, etc. Categorical Values such as 0 & 1, Yes & No.
5 Collinearity There may be collinearity between independent attributes. There should not be collinearity between independent attributes.
6 Relationship The relationship between a dependent variable and the independent variable must be linear. The relationship between a dependent variable and an independent variable

Скачать книгу