Data Analytics in Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 28
A Genome denotes to the complete set of chromosomes of an organism consisting of DNA. Genome sequencing, is a way of mapping out DNA or ordering DNA for organizing, processing and interpreting the sequences, which again requires improvements in sequencing strategies. Each sequencing of DNA faces challenges in searching the sequence pattern, designing, analyzing and interpreting the data.
In gene findings and genome annotation: Gene finding suggests for prediction of nucleotide sequence such as introns and exons in DNA-sequence segments, whereas genome annotation is a process of gene sequencing to find out the gene coding regions to analyze protein sequence [8]. It involves study of the repetitive DNA within the genome, emulated from either same or nearly same sequence.
In sequence comparison: Sequence comparison is the process of comparing two or more than two sequences. Availability of large amount of sequences in genomic database requires proper categorization of DNA and protein sequence. So sequence comparison helps assigning a hypothetical structure and function to a sequence for identification, design and interpretation of sequence [8].
Analysis of sequencing or DNA sequencing is an important task because it helps to detect individual genes that are associated with a disease. When a disease affects an individual, its protein or genes get altered, that causes gene sequence alteration. So it becomes very important to detect these genes to find the cure of the disease. Traditional methods of gene detection were based on trial and error method. Now the advancement in Data mining and machine learning like Neural Network (NN) allow more precise study of genes and its sequence to simplify the task [9]. Many machine learning algorithms are used to classify the normal and abnormal genes with a great accuracy.
Solution to above problems involves following steps
Collection of Biological Data
Building Computational model
Analyze and solve problems of computational model
Test the computation algorithm
Evaluate the performance of the model.
3.2 Biological Datasets
Bioinformatics deals with various biological datasets being collected at different levels of omics data such as
Genomic Sequence data
Protein Sequence data
Microarray data
Structure data (Structure of RNA and protein)
Chemical data
Disease data.
Based on the type of data Biological database can be divided in to two categories:
a. Primary DatabaseThese kinds of databases are archival in nature because these databases are created by the experimental results submitted directly by researchers. These databases are populated with protein sequence, nucleotide sequence or macromolecular structure etc. [10].Example: Protein Data Bank (PDB), GenBank, DNA Data Bank of Japan (DDBJ), Gene Expression Omnibus (GEO).
b. Secondary DatabaseThese databases are either manually created or extracted from result analysis of primary database to create more structured records for easy retrieval of data [10]. Example: Swiss-port (it is protein sequence database maintained by Swiss Institute of Bioinformatics, Switzerland and the European Bioinformatics Institute, UniProt Knowledgebase.
3.3 Building Computational Model
Building Computational model includes study of different behavior of complex system to get some new insights for deepening our understanding. In this section we will discuss some prerequisites which are required for building the computational model.
3.3.1 Data Pre-Processing and its Necessity
After collecting data from database it goes through several processes because data present in the databases are often raw, noisy, incomplete or inconsistent due to these reasons data cannot be used directly for mining process because it may produce unsatisfactory mining result. In order to enhance the classification result, a pre-processing step is initiated as an essential step before mining the data. It usually includes following methods such as data cleaning, data integration, data transformation, dimensionality reduction and so on [11]. Data pre-processing technique significantly improves the quality of data, performance of the classification model and minimizes the time required for actual mining.
We will address some of the problems which need to be solved to achieve better classification result. It involves cleaning of noisy data, missing data, duplicate data, etc. from the database to smoothly conduct the classification process. Noisy data refers to the unnecessary information available in the dataset which is meaningless and cannot be interpreted by machines. It can also be called corrupt data. Presence of these data can affect the data preparation process. To smooth the noisy data binning, clustering, regression methods can be used or can simply be deleted from large datasets based on the amount of noise present [11, 12]. Another biggest problem in biological data is the absence of values. In complex biological datasets this issue greatly impacts the performance of accuracy of the model. So to handle missing values various imputation techniques have to be used [12]. Data duplication are ongoing data quality problem testified in diverse domains, including health care, business and molecular biology, etc. [13] Presences of duplicate data leads to data inconsistency and redundancy which produces several consequences in classification problem. This issue can be handled by detecting and eliminating duplicate values.
Biological data may contain thousands of features because they are highly dependent on comparing the behavior of various biological units. These data often contains large amount of irrelevant data which affect the classification accuracy and machine learning efficiency [14]. Dimensionality reduction technique focuses on reducing the number of input features which aids to reduce computation time and redundant data. Dimensionality can be reduced using different method such as Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), feature selection, etc.
3.3.2 Biological Data Classification
Databases are the rich source of hidden information which can be extracted for intelligent decision making. Classification is a process by which data is organized into different categories by determining a class for an element in database. Data is grouped into different classes based on the training dataset. The classification process helps analyze data and to build models that define data classes and predict future trends.
The availability of large amount of microarray data has created new scopes in classification methods. Like classification of DNA microarrays contributes significantly for diagnosis and prognosis in many clinical practices [15].
Also the classification of gene expression data addresses the fundamental problem of many diseases. Classification of tumor types has been achieving a great success in cancer diagnosis and precision drug development. However, earlier cancer classification studies were clinical based and they have inadequate analytical expertise. Today thousands of gene expressions data is simultaneously monitored because of the advancement in classification algorithms.