Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 32

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

4).

      3.4.1 Markov Chains

      A Markov chain is a sequence of random variables S1; S2; S3; … with the Markov property of limited memory, where a first‐order Markov assumption on the probability for observing a sequence “s1 s2 s3 s4sn” is:

upper P left-parenthesis upper S 1 equals s 1 comma ellipsis comma upper S Subscript n Baseline equals s Subscript n Baseline right-parenthesis equals upper P left-parenthesis upper S 1 equals s 1 right-parenthesis upper P left-parenthesis upper S 2 equals s 2 bar upper S 1 equals s 1 right-parenthesis ellipsis upper P left-parenthesis upper S Subscript n Baseline equals s Subscript n Baseline bar upper S Subscript n minus 1 Baseline equals s Subscript n minus 1 Baseline right-parenthesis

      In the Markov chain model, the states are also the observables. For a HMM we generalize to where the states are no longer directly observable (but still first‐order Markov), and for each state, say S1, we have a statistical linkage to a random variable, O1, that has an observable base emission, with the standard (0th‐order) Markov assumption on prior emissions.

      The key “short‐term memory” property of a Markov chain is: P(xixi − 1, …, x1) = P(xixi − 1) = ai − 1, i, where ai − 1, i are sometimes referred to as “transition probabilities”, and we have: P(x) = P(xL, xL − 1, …, x1) = P(x1)∏i = 2…L ai − 1, i. If we denote Cy for the count of events y, Cxy for the count of simultaneous events x and y, Ty for the count of strings of length one, and Txy for the count of strings of length two: ai − 1, i = P(xy) = P(x, y)/P(y) = [Cxy/Txy]/[ Cy/Ty]. Note: since Txy + 1 = TyTxyTy (sequential data sample property if one long training block), ai − 1, iCxy/Cy = Cxy/∑x Cxy, so counts Cxy is complete information for determining transition probabilities.

      For prokaryotic gene prediction much of the problem with obtaining high‐confidence training data can be circumvented by using a bootstrap gene‐prediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs are usually long genes, and compact in that motif searches upstream usually range over hundreds of bases rather than thousands (as in human).

StartLayout 1st Row upper P left-parenthesis x 0 bar x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis equals upper P left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis slash upper P left-parenthesis x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis 2nd Row equals Counts left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis slash Counts left-parenthesis x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis EndLayout

      TotalCounts(length5)/TotalCounts(length4)

StartLayout 1st Row equals Counts left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis slash Counts left-parenthesis x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis left-bracket left-parenthesis upper L minus 4 right-parenthesis slash left-parenthesis upper L minus 3 right-parenthesis right-bracket 2nd Row approximately-equals Counts left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis slash Counts left-parenthesis x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline semicolon x Subscript negative 4 Baseline right-parenthesis EndLayout Schematic illustration of hash interpolated Markov model (hIMM) and gap/hash interpolated Markov model.

      Gap Interpolated Markov Model (gIMM): like IMM with its count cutoff, but when going to higher order in the interpolation there is no constraint to contiguous sequence elements – i.e. “gaps” are allowed. The resolution of what gap‐size to choose when going to the next higher order is resolved by evaluating the Mutual Information. i.e. when going to 3rd order in the Markov context, P(x0x−5; x−2; x−1) is chosen over P(x0x−3; x−2; x−1) if

upper M upper I left-parenthesis left-brace x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline right-brace 
						<noindex><p style= Скачать книгу