Bioinformatics and Medical Applications. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics and Medical Applications - Группа авторов страница 12
Figure 1.1 Heatmap of input attributes.
Figures 1.2, 1.3, 1.4, and 1.5 display the distribution of some of the input values such as age, gender, presence of cardiovascular disease, and cholesterol type.
1.3.2 Machine Learning Algorithm
Post analysis of the data, it was broken up into training (80%) and testing (20%) sets, respectively. This is necessary to accept the power of the model to summarize new details. A few classifier models have been tested which have been explained as follows.
Figure 1.2 Age distribution.
Figure 1.3 Presence of cardiovascular disease.
Figure 1.4 Cholesterol type distribution.
Figure 1.5 Gender distribution.
1.3.3 Decision Tree
Decision Trees are amazing and well-known devices which are used for classification and forecasting. It is a tree based classifier wherein nodes represent a test on one attribute, leaves indicate the worth of the target attribute, edge represents split of 1 attribute and path is a dis junction of test to form the ultimate decision.
The current implementation offers two stages of impurity (Gini impurity and entropy) and one impurity measure for regression (variability). Gini’s impurity refers to the probability of a misdiagnosis of a replacement variate, if that condition is new organized randomly in accordance with the distribution of class labels from the information set. Bound by 0 occurs when data contains only one category. Gini Index is defined by the formula
Entropy is defined as
where pj is the proportion of samples that belong to class c for a specific node.
Gini impurity and entropy are used as selection criterion for decision trees. Basically, they assist us with figuring out what is a decent split point for root/decision nodes on classification/regression trees. Decision trees utilizes the split point to split on the feature resulting in the highest information gain (IG) for a given criteria which is referred to as Gini or entropy. It is based on the decrease in entropy after a dataset is split on an attribute. A number of the benefits of decision tree are as follows:
• It requires less effort to process data while it is done in advance.
• It does not require standardization and data scaling.
• Intuitive and simple to clarify.
However, it has some disadvantages too, as follows:
• Minor changes in the data can cause major structural changes leading to instability.
• Sometimes math can be very difficult in some algorithms.
• It usually involves more time for training.
• It is very expensive as the complexity and time taken is too much.
• Not adequate on regression and predicting continuous values.
1.3.4 Random Forest
The Random Forest, just as its name infers, increases the number of individual decision trees that work in conjunction. The main idea behind a random forest is the wisdom of the masses. An enormous number of moderately unrelated trees functioning as a council will surpass any existing models. Random Forest allows us to change the contributions by tuning the boundaries like basis, depth of tree, and maximum and minimum leaf. It is a supervised machine learning algorithm, used for both classification and regression. It makes use of bagging and feature randomness while assembling each singular tree to try to make an uncorrelated forest whose expectation is to be more precise than that of any individual tree. The numerical clarification of the model is as given:
1 1. Let D be a collection of dataset used for purpose of training D = (x1, y1) … (xn, yn).
2 2. Let w = w1(x); w2(x) … wk(x) be an ensemble of weak classifiers.
3 3. If every wk is a decision tree, then the parameters of the tree are described as
1 4. Output of each decision tree is a classifier wk(x) = w(x|θk).
2 5. Hence, Final Classification f(x) = Majority Voting of wk(X).
Figure 1.6 gives a pictorial representation of the working of random forest.
Some of the advantages of Random Forest algorithm are as follows:
• Reduces overfitting problem.
• Solves both clasification and regression problems.
• Handles missing values