Data Analytics in Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 30
5. features detection, classification and Sequencing
6. In signal Identification and analysis generated from regulatory sites
7. Protein structure prediction from different sequences
8. Expression of genetic and genomic data
9. In monitoring the treatment of patients based on DNA sequences.
3.3.6 Broadly Used Supervised Machine Learning Techniques
Apart from ANN there are many supervised machine learning algorithms such as Support Vector Machine (SVM) [22], Logistic Regression [23], Decision tree [24], K-nearest neighbors (KNN) [25] and Random Forest [26] which are widely used in the field of bioinformatics and obtaining a high classification accuracy. These models with most popular Artificial Neural Network architecture being used in the literature are further discussed.
3.4 Literature Review
Over the years, artificial neural network has been widely used in gene expression data processing due to its ability to identify the complicated relationships between different attributes in the large data sets. ANN has achieved great success due to their potential to manage the complexity and nonlinearity of biology datasets. Gene expression analysis have aimed for defining more specific biological aspects to enhance patient risk stratification and to guarantee the highest benefits and least toxicity from a specific treatment.
Wisconsin Prognostic Breast Cancer (WPBC) dataset was collected by Samundeeswari et al. [27] to perform an experiment using ANN model. ANN was used to handle the predicting status of patients at a particular endpoint and to predict the time of disease occurrence. Dataset consisted of 35 features and 194 instances. Feedforward neural network model was used, with two hidden layers and 20 neurons in each layer and the entire experiment was carried out in Matlab environment. Model was trained with backpropagation technique and the sigmoid activation function was used for hidden and output nodes. In this research Neural Network proved to perform remarkably with 96% specificity and 97.68% accuracy.
An ANN model was used by Narayanan et al. [28] to identify positive and negative genes related to cancer from a large dataset. A dataset of 74 patients with 7,129 gene expressions was collected. Out of 74, 31samples were normal bone marrow cases and rest of the patients were diagnosed with multiple myeloma. Different experiments were carried out using single layer ANN model. At the end the authors concluded that the requirement of hidden layers in a network is dependent on the complexity of the gene expression dataset. For gene expression analysis single layer neural network was very useful for the sake of simplicity and generalizability and could solve many complex problems with a suitable architectural modification.
Hu et al. [29] designed a classification model to classify bladder cancer cell for six different tumor classes using 467 images. Using both supervised and unsupervised learning methods accuracy of the model was estimated. In supervised learning, MLP of one hidden layer along with backpropagation algorithm was applied to classify and mitigate the error rate. In unsupervised learning, fuzzy and non-fuzzy c-means clustering methods were implemented. Different activation functions such as Gaussian, sigmoid and sinusoid were studied for different network configurations. Using all the available data, and 5 different features neural network classifier was able to capture the information about cancer cells and obtain 96.9% classification rate whereas fuzzy c-means obtained only 76.50%.
In the year 2003, Won et al. [30] classified leukemia dataset consisting of 72 samples with two different classes such as acute lymphoblastic leukemia and acute myeloid leukemia. Each sample had 7,129 gene expression levels representing the input for the model. Model was trained using 38 samples and rest of the samples were used for testing. Researchers used a 3-layered MLP for data classification with 8 hidden nodes and 2 output nodes. Result showed that ANN outperformed with an accuracy of 97%.
Thein et al. [31] used the breast cancer medical dataset with 699 instances and 10 attributes with one class attribute. The dataset was made available by university of Wisconsin hospital, Madison. Attributes 1 to 9 were used to represent features to be used in the model. Each instance belonged to one of two possible classes: Benign or Malignant. According to the class distribution 458 were Benign and 241 instances were Malignant. The dataset was classified using multilayer neural network (MLP) with backpropagation technique and achieved an accuracy of 99.97%. Authors finally depicted that ANN has the greatest tolerance of noisy data and a great ability to classify the untrained data pattern.
Peterson et al. [32] analyzed DNA microarray cancer data set by comparing different machine learning algorithms such as ANN, logistic regression, linear discriminant analysis, SVM and k-nearest neighbor for survival analysis of patients. One of the main findings here was that ANN is dependent on the statistical significance of the features so despite large sample size, ANN outperformed all other classifiers, in achieving greatest area under the curve.
Soto et al. in 2020 [33] dealt with 11_Tumors database, a wellrecognized database of gene microarrays related to cancer disease to generate the likelihood of types of cancer. This database had 12,533 gene expression microarrays for 174 samples and 11 different categories of cancer. Since the dataset had large number of features so the dimensionality reduction technique, PCA was used to reduce the number of features from 12,533 to 113. Classification was done using softsign activation function in the multilayer feedforward network (MLP) consisting of three hidden layers with 100 neurons in each layer. Using sigmoid activation function an output layer of 11 neurons was generated. Upon using 10-fold cross validation classification model was evaluated and achieved an accuracy of 97.14%.
Wei et al. [34] collected 56 cDNA microarray tumor samples form 49 neuroblastoma patients to predict the survival rate of high risk patients. To remove poor quality data principle component analysis algorithm was used and total 42,578 features were reduced to 37,920 data points. Each sample was analysed using a powerful ANN based predictor model. Despite this high complexity, 88% accuracy was seen to be achieved. An ANN-based gene minimization strategy had also been implemented by the author in a separate analysis of 19 genes. In this analytical process high risk patients were further divided in to subgroups based on their survival status. This derived subset of 19 genes correctly classified 98% of the patients. Finally they concluded that ANN-based approach has a significant ability in prediction of survival rate. This would allow personalized therapies for patients according to their gene expression profiles.
Cangelosi et al. [35] developed a robust classification model using ANN to predict neuroblastoma patient’s outcome with minimized error rate by defining a gene expression signature (NB-hypo), which measures the hypoxic status of 100 neuroblastoma tumor gene expression profiles. The ANN-MLP was applied to build a hypoxia predictor of 62 probe sets of the signature with potential clinical application to evaluate the hostile effect of tumor hypoxia on the progression of the disease. The result showed that MLP achieved a similar or better performance when compared to SVM, Naïve Bayes and logistic regression model. ANN proved itself to be a better competitive tool for predicting ‘poor’ or ‘good’ outcome of a patient while making an analysis of complex gene expression data.
Nayeem et al. [36] designed a classifier using 3 different datasets collected from UCI Machine Learning Repository for the diagnosis of heart disease, liver disorder and lung cancer. The proposed network showed an accuracy above 80% for each kind of dataset. MLP along with gradient descent optimization algorithm was used to minimize the error and Levenberg–Marquardt algorithm was used to avoid curve fitting problem.
The feedforward–backpropagation algorithm achieved a good performance even when the number of features