Semantic Web for Effective Healthcare Systems. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Semantic Web for Effective Healthcare Systems - Группа авторов страница 13

Semantic Web for Effective Healthcare Systems - Группа авторов

Скачать книгу

where topics are a kind of features. It is a language model for modeling the topics of documents in a probabilistic approach. Each document may contain a mixture of different topics. Each topic may contain many occurrences of words related to it in documents. Figure 1.6 shows the framework of LDA model for topic (or feature) categorization of text documents.

Schematic illustration of LDA framework.

      For example, the Word 2 is categorized under two different topics say, “topic 1” and “topic 2.” The context of this word varies and it is determined by the co-occurrence of other words. So, the word 2 with the context “topic 1” is more relevant to “Doc 1” and the same word with the context “topic 2” is more relevant to “Doc 2.” Identifying latent concepts thus improves the accuracy of feature categorization.

      For LDA model, the number of topics K has to be fixed in prior. It assumes the generative process for a document w = (w1, . . . ,wN) of a corpus D containing N words from a vocabulary consisting of V different terms, w ϵ {1, …, V} for all i = {1, … , N}. LDA consists of the following steps [12]

      1 (1) For each topic k, draw a distribution over words Φ(k) ~ Dir(α).

      2 (2) For each document d,(a) Draw a vector of topic proportions θ(d) ~ Dir(β).(b) For each word i,(i) Draw a topic assignment zd,i ~ Mult(θd), zd,n ϵ {1, …, K},(ii) Draw a word wd,i ~ Mult(Φz d,i), wd,i ϵ {1, …, V}

      where α is a Dirichlet prior on the per-document topic distribution, and β is a Dirichlet prior on the per-topic word distribution. Let θtd be the probability of topic t for document d, zdi be the topic distribution, and let Φtw be the probability of word w in topic t. The probability of generating word w in document d is:

       D—number of documents

       N—number of words or terms

       K—number of topics

       α—a Dirichlet prior on the per-document topic distribution

       β—a Dirichlet prior on the per-topic word distribution

       θtd—probability of topic t for document d

       Φtw—probability of word w in topic t

       zd,i—topic assignment of term “i”

       wd,i—word assignment of term “i”

       C—correlation between the terms

Schematic illustration of plate notation of CFSLDA model.

      LDA associates documents with a set of topics where each topic is a set of words. Using the LDA model, the next word is generated by first selecting a random topic from the set of topics T, then choosing a random word from that topic's distribution over the vocabulary W. The hidden variables θ and Φ are determined by fitting the LDA model to a set of corpus documents. CFSLDA model uses Gibbs sampling for performing the topic modeling of text documents. Given values for the Gibbs settings (b, n, iter), the LDA hyper-parameters (α, β, and k), and TD matrix M, a Gibbs sampler produces “n” random observations from the inferred posterior distribution of θ and Φ [60].

       image image

      Ontology development includes various approaches like Formal Concept Analysis (FCA) or Ontology Learning. FCA applies a user-driven step-by-step methodology for creating domain models, whereas Ontology learning refers to the task of automatically creating domain Ontology by extracting concepts and relations for the given data set [27]. This chapter

Скачать книгу