Data Analytics in Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 16
Polynomial Regression: It is used for curvilinear data [57–58].
Stepwise Regression: It works with predictive models [59–60].
Ridge Regression: Used for multiple regression data [61–62].
Lasso Regression: Used for the purpose of variable selection & regularization [63–64].
Elastic Net Regression: Used when the penalties of lasso and ridge method are combined [65].
1.5 Random Forest
The Random Forest was first invented by Tim Kan Ho [66]. Random Forest is a supervised ensemble learning method, which solves regression and classification problems. It is a method of ensemble learning (i.e. bagging algorithm) and works by averaging the result and by reducing overfitting [67–71]. It is a flexible method and a ready to use in the machine learning algorithm. The Random Forest can be used for the process of regression and known as Regression Forests [72]. It can cope up with the missing values but deals with complexity as well as a longer training period. There are two specific causes for naming it as Random that are:
When building trees, then a random sampling of training data sets is followed.
When Splitting nodes, then a random subset of features is considered.
The functioning of random forests is illustrated in Figure 1.12.
In the above figure, five forests are there and each one representing a disease, such as blue represents liver disease, orange represents heart disease, the green tree represents stomach disease, yellow represents lung disease. It was observed that as per the majority of color, Orange is the winner.
Figure 1.12 Random forest.
This concept is known as the Wisdom of crowd as discussed in Ref. [73]. The execution of this method is achieved with the help of two concepts, which is listed below
Bagging: The Data on which the decision trees are trained are very sensitive. This means a small change in the data can bring diverse effects in the model. Because of this, the structure of the tree can completely change. They take benefit of it by allowing each tree to randomly sample the dataset with a replacement that results in different trees. This is called bagging or bootstrap aggregation [74–75].
Random Feature Selection: Normally, when we split a node, every possible feature is considered. The one that produces the most separation is considered. Whereas, in the random forest scenario we can consider a random subset of features. This allows more variation in the model and results in a greater diversification. [76]
The Concept of Random Forest took place in the heart disease dataset also. The low correlation is the key, between the models. The Area under the ROC Curve (AUC) characteristic of Random Forest performed in python (Google Colab) is shown in Table 1.4 and Figure 1.13.
In the above table, the area under the receiver operating characteristic curve (AUC) is mentioned.
AUC measures the degree of separability. The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result indicates that the used models perform outstandingly on the heart disease dataset.
Table 1.4 AUC: Random forest.
Parameter | Data | Value | Result |
The area under the ROC Curve (AUC) | Training Data | 1.0000000 | Outstanding |
Test Data | 1.0000000 | Outstanding | |
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding |
Figure 1.13 ROC curve for random forest.
1.6 K-Nearest Neighbor
K-Nearest Neighbor belongs to the category of supervised classification algorithm and hence, needs labeled data for training [77, 78]. In this approach, the value of K is suggested by the user. It can be used for both the classification and regression approaches but the attributes must be known. By performing the KNN algorithm, it will give new data points according to the k-number or the closest data points.
In the heart disease dataset also, The Area under the ROC Curve (AUC) has been used. It is the most basic tool for judging the classifier’s performance in a medical decision making concerns [79–81]. It is a graphical plot for judging the diagnostic ability with the help of a binary classifier. The generated ROC curve for KNN on the heart disease dataset [41] is presented below in Figure 1.14.
In the above figure, the true positive rate (probability of detection) is mentioned on the Y-axis, and on the x-axis, the false positive rate (probability of false alarm) is mentioned. The False Positive rate depicts the unit proportion with a known negative condition for which the predicted condition is positive.
The Area under the ROC Curve (AUC) of K-nearest neighbor is performed on the heart disease dataset [41] in python (Google Colab) and shown below in Table 1.5.
Figure 1.14 ROC curve for k-nearest neighbor.
Table 1.5 AUC: K-nearest neighbor.
Parameter | Data | Value | Result |
The area under the ROC Curve (AUC) | Training Data | 1.0000000 | Outstanding |