Читать онлайн книгу - Bioinformatics and Medical Applications. Группа авторов. Программы. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Bioinformatics and Medical Applications - Группа авторов

Скачать книгу

rules, c₁(X); c₂(X); c₃(X), we join these rules by majority voting as

1.4 Proposed Method

1.4.1 Experiment and Analysis

Naive Bayes multi-model decision-making system, which is our proposed method uses ensemble method of type majority voting using a combination of Naive Bayes, Decision Tree, and Random Forest for analytics in the database of heart disease patients and attains an accuracy that outperforms any of the individual methods. Additionally, it uses K means along with the combination of the above methods for further increase the accuracy.

The data pertains to Kaggle dataset for cardiovascular disease which contains 12 attributes. Whether or not cardiovascular disease is present is contained in column carrying target value which is a binary type having values 0 and 1 indicating absence or presence respectively. There are a total of 70,000 records having attributes for age, tallness, weight, gender, systolic and diastolic blood pressure, cholesterol, glucose, smoking, alcohol intake, and physical activity.

Training and testing data is divided in the ratio 70:30. During training and testing, we tried various combinations to see their effect of accuracy of predictions. Also, we took data in chunks of 1000, 5000, 10,000, 50,000 and 70,000, respectively, and observed the change in patterns. We tried various combinations to check on the accuracy.

• NB: Only Naive Bayes algorithm is applied.

• DT: Only Decision Tree algorithm is applied.

• RF: Only Random Forest algorithm is applied.

• Serial: Naive Bayes followed by Random Forest followed by Decision tree (in increasing order of individual accuracy).

• Parallel: All three algorithms are applied in parallel and maximum voting is used.

• Prob 60 SP: If probability calculated by Naive Bayes is greater than 60% apply serial method else apply parallel.

• PLS: First parallel then serial is applied for wrong classified records.

• SKmeans: Combination of Serial along with K means.

• PKmeans: Combination of Parallel along with K means.

From this analysis, we found the PKmeans method to be the most efficient. Though serial along with K means achieves the best accuracy for training data, it is not feasible for real data where target column is not present. The reliability on any single algorithm is not possible for correctly classifying all the records; hence, we use more suitable ensemble method which utilizes the wisdom of the crowd. It uses the ensemble method of the type majority voting which includes adding the decisions in favor of crisp class labels from different models and foreseeing the class with the most votes.

Our goal is to achieve the best possible accuracy which surpasses the accuracy achieved by the individual methods. Figures 1.8 to 1.11 show the confusion matrix plotted by Naive Bayes, Random Forest, and Decision Tree individually as well as their ROC curve.

Schematic illustration of the NB confusion matrix.

Figure 1.8 NB confusion matrix.

Schematic illustration of the RF confusion matrix.

Figure 1.9 RF confusion matrix.

Schematic illustration of the DT confusion matrix.

Figure 1.10 DT confusion matrix.

Figure 1.11 ROC curve analysis.

1.4.2 Method

We observed that by applying ensemble method of type majority voting on the algorithms Decision tree, Random Forest, Naive Bayes, and K means, we could achieve an accuracy of 91.56%. To additionally improve the precision, we proposed the following algorithm. The design of the proposed method is as given in Figure 1.12.

Schematic illustration of the proposed architecture.

Figure 1.12 Proposed architecture.

Algorithm 1.1 Probabilistic optimization.

initialization

d ← dataset

a1 ← Naive_Bayes_output ← ApplyNaiveBayes(d)

a2 ← Decision_tree_output ← ApplyDesisionTree(d)

a3 ← Random_tree_output ← ApplyRandomForest(d)

a4 ← K_Means_output ← ApplyKmeans(d)

winner(0, 1) ← Voting(a1, a2, a3, a4)

op ← winner_of_max_count(0,1)

if op ≠ desired_output then

Probability_calculation of each column with output 0 or 1

end

For each value in c_i

count ← c_i/2

For k to count

Add the probability (Find the max column with which probability matches)

Number of columns selected as ti

wi ← Weightage of selected columns

αi ← Append the weightage with the input of data

Find mean square error with the training and find lowest (MSE) parameter. Calculate the Euclidean distance

Find the minimum distance using this formula.

Скачать книгу

Bioinformatics and Medical Applications. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics and Medical Applications - Группа авторов страница 14

Информация о книге:

1.4 Proposed Method

1.4.1 Experiment and Analysis

1.4.2 Method

Bioinformatics and Medical Applications. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics and Medical Applications - Группа авторов страница 14

Информация о книге:

1.4 Proposed Method 1.4.1 Experiment and Analysis

1.4.2 Method

1.4 Proposed Method

1.4.1 Experiment and Analysis