Data Mining and Machine Learning Applications. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Data Mining and Machine Learning Applications - Группа авторов страница 13

Data Mining and Machine Learning Applications - Группа авторов

Скачать книгу

includes clustering and classification problems. Let us discuss each of them in detail [1–6].

       Classification

      It is a task in data mining where data can be modeled and distinguished into classes. One can say it is a process where given objects are classified/categorized to form a new class. Initially, the training set is identified, and new observations are derived. Hence, this task is classified into two phases, i.e., the learning/training phase and the classification of the given objects. E.g., a bank manager can wish to classify the loans borrowed by customers based on risky category, less risky category and trustworthiness, etc. To execute this classification technique on the given objects, the idea is to use classifier/s—where rules are applied, training is given, and given data is classified into the desired classes. The following are the classification algorithms that can be used in data mining:

       Logistics regression

       Naïve Bayes

       K nearest

       Decision tree

       Random forest

       Support Vector Model.

       Clustering

      It is a grouping of objects based on similarity. A threshold is applied, and an object can be added to the specific cluster where the criteria can be satisfied. This technique is helpful in various applications such as—

       Market basket analysis

       Pattern recognition

       Image processing

       Financial analysis.

      It is categorized as unsupervised learning, where the given data is used to compare with the threshold (predefined value). The clustering approach can be categorized into intra-cluster and inter-cluster.

       Types of Clustering

      Clustering is nothing but a grouping of elements based on similarity and its unsupervised learning technique. One can apply partition clustering, which is also known as non-hierarchical clustering, to classify the data/records/values into ‘k’ groups/clusters. This is an iterative process and works until the last element is processed. Users can use the SVM model—support vector machine, where ‘n’ features will be identified in the initial phase, and then those features will be processed to identify the relevant results.

       ◦ K-means clustering algorithm can be used to train the samples. Using this clustering method, it is possible to identify the nearest cluster by training the samples. Training the samples is nothing but finding the distance between samples and the nearest clusters. Distance is calculated between the samples, and the sample with a larger distance is likely to be selected as a center point. (One can use Euclidean distance metric in this case). K-means stores centroids (‘k’ points) that it uses to define the clusters to be formed. An object/value is considered to be in a specific cluster if it is closer to that cluster’s centroid.

       ◦ Hierarchical: It is one of the popular algorithms used in data mining and machine learning. The idea is to find the two clusters which are closer to each other and merge them to form a single cluster. Repeat this process until all the desired clusters are merged. This is categorized into top-down and bottom-up approaches, i.e., known as agglomerative and divisive approaches. We can define this type as the nesting of clusters that can be nested together to form a tree (merged cluster).

       ◦ Fuzzy: Clusters are treated as fuzzy sets and allocate the objects to these clusters. It is unsupervised, and as its name suggests, one can check the probability of each point whether it belongs to multiple clusters instead of belonging to a single cluster. It is also treated as soft clustering. One of its popular applications is pattern recognition. Minimization of the objective function is its primary objective, and hence the number. of iterations may increase. As for the number of iterations are ‘n’, it may increase the time complexity of the algorithm.

      

       Subject-oriented—designed for a specific subject/s

       Integrated—integrates different data from multiple sources.

       Non-volatile—data once stored remains stable and does not change over time.

       Time-variant—it looks at change over time.

      One can compare data warehouse and OLTP as follows:

       Decision trees: It is a tree-like structure that helps identify the possible outcomes/results/consequences, etc. It is usually used in a decision support system. One can say it can be used in classification and prediction. It resembles a tree-like structure where leaf nodes represent the outcomes/results, etc. as shown in Figure 1.4. As it is a tree-like structure, classification/prediction starts from the root node and traverses through the leaf nodes. Its benefit is there is no need for high computation to find perfect predictions [1–6].

      If there are ‘n’ nodes (root node and leaf nodes) in a sorted manner, then the best option/desired option can be found within less time.

       Genetic algorithms (GAs): It helps in finding possible solutions. These algorithms help to optimize the given problem and find better solutions. One can categorize the identified solutions into optimal and near-optimal solutions. It may comprise of ‘n’ computations and hence known as an evolutionary approach to find the perfect solution. In NP-hard problems, it has been proven that usable near-optimal solutions can be found using GAs. This concept is related to biology, i.e., chromosomes, genes, and population. These terms can be described in the computations as follows:Figure 1.4 Decision Tree.Chromosome—one possible solutionPopulations—set and subset of all possible solutionsGenes—one element of the chromosome

      GAs

Скачать книгу