Semantic Web for Effective Healthcare Systems. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Semantic Web for Effective Healthcare Systems - Группа авторов страница 12

Semantic Web for Effective Healthcare Systems - Группа авторов

Скачать книгу

extraction from product or service review documents often includes different steps like data pre-processing, document indexing, dimension reduction, model training, testing, and evaluation. Labeled data set of document collection is used to train or learn the model. Further, the learned model is used for identifying unlabeled concept instances from the new set of documents. Document indexing is the most critical and complex task in text analysis. It decides the set of key features to represent the document. It also enhances the relevancy between the word (or feature) and the document. It needs to be very effective as it decides the storage space required and query processing time of documents.

Schematic illustration of term weighing schemes for feature extraction.

      IR like Vector Space Model (VSM), Latent Semantic Indexing (LSI), topic modeling techniques, and clustering techniques are used in the feature extraction of text documents for term weighing process. The following sub sections describe the rationales of different feature extraction techniques used in text analysis.

      1.4.1 Vector Space Model

      1.4.2 Latent Semantic Indexing (LSI)

Schematic illustration of synonymy and polysemy issues in English. Schematic illustration of approximated TD matrix by SVD.

      LSI indexes words using low dimensional representation and word co-occurrence. The association of terms with documents, i.e., the semantic structure improves the relevancy of results for queries [56]. Value of “k” in low hundreds improves precision and recall value. LSI has its own disadvantages like more computation time and negative values in the approximated TD matrix.

      1.4.3 Clustering Techniques

      Clustering methods identify similar groups of data in a data set collection. Centroid model, the K-Means algorithm, is an iterative clustering algorithm groups all the data point closer to the centroid. It is important to have prior knowledge on the data set, as this algorithm takes the number of clusters as input. It partitions the “n” data points into “k” clusters in which each data point belongs to the cluster with the nearest mean. There are many variations exist in using K-Means algorithm like using Euclidian distance between centroid and the data point, fuzzy C-Means clustering and so on. Like LDA, K-Means is also an unsupervised learning algorithm where the user needs to give the number of clusters required. The only difference is that K-Means produces “k” disjoint clusters whereas LDA assigns a document to a mixture of topics. The problems like synonymy and polysemy can be better resolved with the use of LDA than K-Means algorithm technique.

      1.4.4 Topic Modeling

Скачать книгу