Artificial Intelligence and Data Mining Approaches in Security Frameworks. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Artificial Intelligence and Data Mining Approaches in Security Frameworks - Группа авторов страница 13
a) Decision Trees
One of the most popular machine learning techniques is Quinlan’s decision tree technique. A number of decisions and leaf nodes are required to construct the tree by following divide-and conquer technique (Rathore et al., 2013). A condition needs to be tested by using attributes of input data with the help of each decision node to handle separate outcome of the test. In decision tree, we have a number of branches. A leaf node is represented by the result of decision. A training data set T is having a set of n-classes {C1, C2,..., Cn} when the training dataset T comprises cases belonging to a single class, it is treated as a leaf. T can also be treated as a leaf if T is empty with no cases. The number of test outcomes can be denoted by k if there are k subsets of T i.e. {T1, T2, ..., Tk}, where. The process is recurrent over each Tj, where 1 <= j<= n, until every subset does not belong to a single class. While constructing the decision tree, choose the best attribute for each decision node. Gain Ratio criteria are adopted by the C4.5 Decision Tree. By using this criterion, the attribute that provides maximum information gain by decreasing the bias/favoritism test is chosen. Thus, to classify the test data that built tree is used whose features and features of training data are the same. Approval of the above test can be done by starting from the root node. On the basis of the result, a branch that leads to a child must be followed. The process would be repeated recursively for the time until the child is not a leaf. To examine a class and its corresponding leaf, test cases must be applied.
b) Genetic Algorithms (GA)
It is used to solve a problem by using biological evolution techniques with the help of machine learning approach. A population of candidate solutions can be optimized with the help of Genetic Algorithm. In genetic algorithm genetic operators, i.e., selection, crossover and mutation are helpful for data structures modelling on chromosomes (Fu et al., 2006). In the beginning, random generation of a population of chromosomes could be performed. In this way, there will be all possible solutions of a problem in the population and that is considered as the candidate solutions. Dissimilar locations of a chromosome called “genes” which can be determined as numbers, characters or bits. To evaluate the goodness of each chromosome on the basis of the desired solution, we use fitness function. Natural reproduction can be stimulated by crossover operator whereas mutation of the species is stimulated by mutation operator. Fittest chromosomes can be chosen by the selection operator (Manek et al., 2016). Genetic Algorithms and its operations can be represented by Figure 2.2. Following are three important factors which we have to consider before using genetic algorithm for solving various problems.
Figure 2.2 Flowchart of genetic algorithm.
1 Fitness function
2 Individuals representation
3 Genetic algorithms parameters
For designing an artificial immune system, genetic algorithm-based method can be used. By using this method, a method for smartphone malware detection has been proposed by Bin et al. (Wu et al., 2015). In this approach, static and dynamic signatures of malwares were extracted to obtain the malicious scores of tested samples.
c) Random Forest
It is a classification algorithm that uses collection of tree structured classifiers. In this algorithm, a class is chosen as winner class on the basis of votes given by an individual tree of the forest. To construct a tree, there is a requirement of arbitrary data from a training dataset. Thus, the selected dataset could be divided into training dataset and test dataset. Training data comprises the major portion of the dataset whereas the test data will have the minor portion of the dataset. Following are the steps required for the tree construction:
1 A sample of N cases is arbitrarily selected from the original dataset which represents the training set required for growing the tree.
2 Out of the M input variables, m variables can be selected arbitrarily. Value of m will be constant at the time of growing the forest.
3 Maximum possible value can be given to each tree in the forest. There is no requirement of trimming or Pruning of the tree.
4 To form the random forest, all classification trees can be combined. The problem of overfitting on large dataset can be fixed with the help of random forest. It can also train/ test quickly on complex data set. It can also be referred as Operational Data mining technique.
Each and every classification tree can be used to cast vote for a class because of its special feature. On the basis of maximum votes assigned to a class, a solution class is built.
d) Association-rule mining
It is used to find fascinating relationships among a set of attributes in datasets (Dwork et al., 2006). Association rule can be defined as inter-relationship of a dataset. It is very helpful to build strategic decisions about different actions like shelf management, promotional pricing, and many more (Jackson et al., 2007). Earlier, a data analyst was involved in association rule mining whose task is to discover patterns or association rules in the dataset given to him (Rathore, 2017). It is possible to attain sophisticated analysis on these extremely large datasets in a cost-effective manner (Tseng et al., 2016), but there may be a chance of data security risk (Beaver et al., 2009) for the data possessor because data miner cans mines sensitive information (Bhargava et al., 2017). Nowadays, in knowledge data discovery (KDD) association rule mining is extensively used for pattern discovery. A problem of (ARM) can be solved by navigating the items in a database with the help of various algorithms on the basis of user’s requirement (Patel et al., 2014). Association rule mining (ARM) algorithms can be broadly classified into DFS (Depth First Search) and BFS (Breadth First Search) on the basis of approach used for traversing the search space (Stanley, 2013). These two methods, i.e., DFS (Depth First Search) and BFS (Breadth First Search) are further divided into methods – intersecting and counting, on the basis of item sets and their support value. The algorithms Apriori-DIC, Apriori and Apriori-TID are BFS-based counting strategies algorithms, whereas partition algorithms are intersecting strategies BFS algorithms. The Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm works on the intersecting strategy with DFS. DFS with Counting strategies comprises FP-Growth algorithm (Yeung, Ding, 2003), (Bloedorn et al., 2003). For improvement in speed, these algorithms can be optimized specifically (Barrantes et al., 2001), (Reddy et al., 2011).
Breadth First Search (BFS) with Counting Occurrences: An eminent algorithm in this group is Apriori algorithm. By clipping the candidates with rare subsets and with the help of this algorithm, the downward closure property of an itemset can be utilized. It should be done before counting their support. Two important parameters to be measured at the time of association rule evaluation which is: support and confidence. In BFS, it is possible to do desired optimization by knowing the support values of all subsets of the candidates in advance. The main drawback of the above mentioned is the increment in computational complexity in a rule that has been extracted from a large database. An improved, dispersed and unsecured form of the Apriori algorithm is Fast Distributed Mining (FDM) algorithm (Lee et al., 1999). Organizations are able to use data more competently with the help of advancements in data mining techniques.