Biomedical Data Mining for Information Retrieval. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Biomedical Data Mining for Information Retrieval - Группа авторов страница 23
One can train the classifiers, and classify the new compounds considering either single or combination of parameters: activity (active/non-active), drug-likeness, pharmacodynamics, and pharmacokinetics or toxicity profiles of known compounds [91]. Nowadays, a lot of open-source as well as commercial applications, are available for predicting skin sensitisation, hepatotoxicity, or carcinogenicity of compounds [101]. Apart from this, several expert systems are in use for finding the toxicity of unknown compounds using knowledgebase information [102, 103]. These expert systems are artificial intelligence-enabled expert systems that are using human knowledge (or intelligence) to reason about problems or to make predictions. They can make qualitative judgements based on qualitative, quantitative, statistical and other evidence provided to them as an input. For instance, DEREK and StAR use the knowledge-based information to derive new rules that can better describe the relationship between chemical structure and their toxicity [102]. DEREK uses a data-driven approach to predict the toxicity of a novel set of compounds given in the training dataset and compare them to given biological assay results to refine the prediction rules. Toxtree is an open-source platform to detect the toxicity potential of chemicals. It uses the Decision Tree (DT) classification machine learning algorithm based classification model to estimate toxicity. The toxicological data of chemicals derived from their structural information is used as an input to feed the model [104].
Besides expert systems, there are also some other automated prediction methods like Bayesian methods, Neural Networks, Support Vector Machines. Bayesian Inference Networks (BIN) is among one of the crucial methods that allow a straightforward representation of uncertainties that are involved in the different medical domains involving diagnosis, treatment selection, prediction of prognosis and screening of compounds [105]. Nowadays, doctors are using these BIN models in the prognosis and diagnosis. Use of BIN models in the ligand-based virtual screening domain tells their successful implications in the field of drug discovery. A comparative study was done to find the efficiency of three models: Tanimoto Coefficient Networks (TAN), conventional BINs and BIN Reweighting Factor (BINRF) for screening billions of drug compounds based on structural similarity information [106]. All three models use MDL Drug Data Report (MMDR) database for training as well as testing purposes. The ligand-based virtual screening, which utilizes the BINRF model, not only significantly improved the search strategy, it also identified the active molecules with less structural similarity, compared to TAN and BIN-based approaches. Thus, this is an era of the integrative approaches to achieve higher accuracy in drug or drug target prediction.
Bayesian ANalysis to determine Drug Interaction Target (BANDIT), uses a Bayesian approach to integrate varied data types in an unbiased manner. It also provides a platform that allows the integration of newly available data types [107]. BANDIT has the potential to expedite the drug development process, as it spans the entire drug search space starting from new target identification and validation to clinical candidate development and drug repurposing.
Support Vector Machine (SVM) is a supervised machine learning technique most often used in knowledge base drug designing [108]. The selection of appropriate kernel function and optimum parameters are the most challenging part in the problem modelling, as both parameters are problem-dependent. Later on, a more specific kernel function is designed that can control the complexity of subtrees by using parameter adjustments. The SVM model integrated with the newly designed kernel function successfully classifies and cross-validates small molecules having anti-cancer properties [109]. Graph kernels-based learning algorithms are widely in SVMs, and they can directly utilise graph information to classify compounds. The graph kernel-based SVMs are employed to classify diverse compounds, to predict their biological activity and to rank them in screening assays. Deep learning algorithms that mimic the human neural system, artificial neural network (ANN) also have applications in the drug discovery process. The comparable robustness of both SVM and ANN algorithms were checked in term of their ability to classify between drug/non-drug compounds [110]. The result is in support of SVM as it can classify the compounds with higher accuracy and robustness compared to ANN.
Other machine learning algorithms: Decision tree, Random forest, logistic regression, recursive partitioning are also successfully applied to classify compounds using relationship criteria between their chemical structure and toxicity profiles [111]. The comparative study of ML algorithms shows that non-linear/ensemble-based classification algorithms are more successful in classifying the compounds using ADMET properties. Random Forest algorithms can also be used in ligand pose prediction, finding receptor-ligand interactions and predicting the efficiency of docking simulations [112]. Nowadays, Deep Learning (DL) methods are achieving remarkable success in the area of pharmaceutical research starting from biological-image analysis, de novo molecule design, ligand– receptor interaction to biological activity prediction [113]. So the continuous improvements in machine learning and deep learning algorithms will help to achieve desired results with higher prediction accuracy in the drug designing field.
Multiple descriptors represent the molecular data in terms of their structural and physicochemical features. These descriptors are responsible for diverse bioactivity of compounds [114]. Apart from descriptor-based bioactivity prediction of chemicals, substructure mining is also an established technique in the field of drug discovery. The substructure mining is also a data-driven approach that uses a combination of algorithms to detect the most frequently occurring substructures from a large subset of the known ligands [115]. There are two ways to use the substructure mining: one way is to use a predefined list of candidate scaffolds. The substructure mining algorithm identifies and extracts all the candidate scaffolds present in known compounds of a given database. While the second approach of substructure mining adaptively learns the substructures from the compounds. Both the ways are capable of getting all the significant 2D substructures from any chemical databases [116]. The popularity of the substructure mining approaches is highly appreciable for establishing a common consensus among medicinal chemists who later on start treating chemical compounds as a collection of their sub-structural parts. Application of the approach to establish structure–activity relationships will build more confidence in stating that biological properties of molecules are dependent upon their structural properties.
Later on, several substructure mining algorithms have been developed to accommodate the needs of an ever-changing drug discovery process [117]. The subgraph mining approach is unique as it is free from any kind of arbitrary assumption, compared to other approaches. In other words, the current subgraph mining techniques are capable of retrieving all frequent occurring subgraphs from a given database of chemical compounds in significantly less time with minimum support [118]. Furthermore, as described above, the idea behind these techniques is to enable us to find the most significant subgraph out of all possible subgraphs. Shortly, the use of Artificial intelligence-based techniques in medicinal chemistry will become more complex, due to the increasing availability of huge repositories containing chemical, biological, genetic, and structural data. The implementation of the complex algorithm on ever-increasing data volume for searching a new, safer and more effective drug candidates leads to the use of quantum computing and high-performance computing. In summary, we believe that these techniques will become a much more significant part of drug discovery endeavours within a very short time.
2.8 Conclusions
AI and ML have come up as great tools for structure prediction but these techniques rely to a great deal on collection of phenotype data, and not genomic data which may be its disadvantage. Genome researchers have learned that much of the variation between individuals is the result of a number of discrete, single-base changes also known as single nucleotide polymorphisms, or SNP’s in the human genome, which affects the phenotype. Application of ML to SNP data can be done