Data Analytics in Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 23

Data Analytics in Bioinformatics - Группа авторов

Скачать книгу

Hence, the clusters are now labeled based on their similarities [12].

      During the microarray analysis the data gathered from microarray images are used to construct matrices (refer to Table 2.1) in which rows hold genes and columns hold different conditions viz. different tissues, clinical condition, different biological processes. Expression level data is maintained in each cell of the matrix (refer to Table 2.1) which is uniquely numbered using every gene in every other sample.

      Table 2.1 Gene expression data matrix representation.

Sample 1 Sample 2 Sample 3 Sample 4 Sample m
Gene1 C11 C12 C13 C13 C1m
Gene2 C21 C22 C23 C24 C2m
Gene3 C31 C32 C33 C34 C3m
Gene4 C41 C42 C43 C44 C4m
Gene n Cn1 Cn2 Cn3 Cn4 Cnm

      These gene expression matrices are summed up into a database as shown in Figure 2.2 which acts as a repository for all the genetic information like gene regulation, genetics functionality of diseases, drug discovery based upon gene structures, response to drug projection and so on. These databases with all the genomic information when searched for particular instance of genes, it retrieves all the relevant information with high similarities using these clustering algorithms [32].

      The sequence of actions for Clustering of Gene Expression (GE), include

      Figure 2.2 Example matrix of gene expression (10 genes in a row and 2 samples in columns) [29–31].

       A matrix which hold the data of GE structure which consists of size, dimension, number, origin, etc.

       Features are extracted to identify the similarities (feature extraction). The input data is reduced without loss of information making it more suitable to application is called feature selection or feature reduction. The feature reduction can be achieved using few common algorithms for example Principal Component Analysis [11].

       Based upon similarities or features, all the genes which express similar expression are made into one cluster and which are distinct are grouped into another cluster. The degree of similarity or the proximity levels are computed using certain distance metrics applied on gene expression data provided. Euclidean distance, Manhattan distance, Mahalanobis distance, etc. are few metrics applied to quantify the level of similarity in order to form a cluster.

       2.3.2 Clustering Algorithms

      The genetic data (genes) are closely connected. Based upon the dataset appropriate clustering algorithms need to be chosen. These clustering approaches also reveals about the relationship between the clusters based upon the similarity. This section explains various partition algorithms. Figure 2.3 summarizes various clustering based partitions those can be used in Bioinformatics.

Schematic illustration of the partition clustering algorithms.

      Figure 2.3 Partition clustering algorithms.

       2.3.3.1 k-Means Clustering

      This algorithm partitions the dataset of n objects into k clusters in the feature space. The value of k (number of clusters) is fixed beforehand. Initially the algorithm randomly assumes k number of cluster centers (cluster mean) in the feature space. The distance (computed using Euclidean distance) of each gene sample is computed from all these cluster means. The sample is then grouped under the cluster that has minimum distance from the sample. The same process is followed for all the samples. Once all the samples are grouped under one of these clusters, each cluster mean is computed again considering the entire samples in that cluster and the cluster mean is updated. The above steps are iterated until the algorithm forms k stable clusters.

      The drawback of such algorithm is that determining the clusters (k) prior to implementation reduces the effectiveness of cluster formation and is prone to noisy data in gene expression values [13].

       2.3.3.2 Cluster Center Initialization Algorithm (CCIA)

      In traditional k-means algorithm the centroids of clusters are randomly selected initially, yielding low quality clusters. As an attempt in producing high quality clustering CCIA is introduced. This algorithm can be referred as an enhancement to k-means algorithm in which centroids (k) are accurately selected which are distant from the outliers. This selection process is done by computing mean and standard deviation on the data elements. CCIA now examines most nearest set of elements from the dataset (D) and fabricates into a new dataset (X) after discarding the outliers. Later those set of elements are eliminated from the dataset (D). This process is iterated until the dataset (X) equals with (k).

      Drawback—This algorithm is incapable in dealing with closely connected clusters and cannot deal with high dimensional data [14, 15].

       2.3.3.3 Intelligent Kernel k-Mean (IKKM)

      Generally gene expression data is nonlinearly separable and the gene data is represented in a matrix format using linear kernel function as shown in Figure 2.2. IKKM is applied on the kernel matrix in which clusters need not be predefined.

      Drawback—Cannot

Скачать книгу