Sequence Motif Modeling for Bioinformatics: A Comprehensive Overview

Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture outline • Sequence motifs • Biological motivations • Representations • k-mer counting • Introduction to statistical modeling • Motivating examples • Generative and discriminative models • Classification and regression • Example: Naive Bayes classifier CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 1 Sequence Motifs

Sequence motifs • Many biological activities are facilitated by particular sequence patterns • The restriction enzyme EcoRI recognizes the DNA pattern GAATTC and cuts the DNA as follows: • The human protein GATA3 binds DNA at regions that exhibit the pattern AGTAAGA, where the G at position 6 can also be A, and the A at position 7 can also be G or C G CTTAA AATTC G CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Sequence motifs • In general, small recurrent patterns on biological sequences with particular functions are called sequence motifs • We need models to represent the motifs, usually based on some examples. Goals: • These models do not miss true occurrences (i.e., have low false negative rate), and do not include false occurrences (i.e., have low false positive rate) • These models should take uncertainty into account • These models should be as simple as possible • Computability • Interpretability • Generality CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Motif representations • Suppose we have the following sequences known to be bound by a protein: • CACAAAC • CACAAAT • CGCAAAC • CACAAAC • Consensus sequence: • CACAAAC • Problem: Information loss • Degenerate sequence in IUPAC (International Union of Pure and Applied Chemistry) code (see http://www.bio-soft.net/sms/iupac.html): • CRCAAAY Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Motif representations • Suppose we have the following aligned TFBS sequences: • CACAAAAC • CACAAA_T • CGCAAAAC • CACAAA_C • Regular expression (see http://en.wikipedia.org/wiki/Regular_expression for syntax) • E.g., C[AG]CA{3,4}[CT] • Are there other possible regular expressions? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Motif representations • Position weight matrix • Pseudo-counts: add a small number to each count, to alleviate problems due to small sample size ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Motif representations • Sequence logo • Nucleotide with the highest probability on top • Total height of the nucleotides at the i-th position, • pi,x: probability of character x at position i • n: number of sequences • Height of nucleotide x = pi,xhi CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Using a motif • Consensus sequence: • Predict “Yes” if a sequence matches the consensus sequence; “No” otherwise • Regular expression: • Predict “Yes” if a sequence can be generated by the regular expression; “No” otherwise • Position weight matrix: • Compute a matching score for a sequence, and consider a sequence to be more likely to belong to the class if it has a higher score CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

PWM matching score • Suppose the PWM of the binding sites of a protein is as follows: • For the sequence ATGGGGTG, the likelihood is 0.90.70.70.80.10.20.70.8 = 0.00395136 • Compute the odds against background probabilities of the four nucleotides: 0.00395136 / (pApG5pT2) • Usually take log2 of the odds as the final score CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

k-mers • Another way to represent sequence motifs: k-mers • Training examples: • ACCGCT • TACCGG • TTACCA • AACCTG • One vague way to summarize: “This motif is AC- and CC-rich” CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

k-mers • Considerations: • Value of k • Too small: • Capturing only local patterns • Too large: • Too restrictive • Too many possible k-mers (computationally difficult) • Allowing wildcards or not • g-gapped k-mer: among the g+k positions, only k of them are considered and the remaining g positions are ignored (Here “gapped” means unspecified positions in the pattern, i.e., wildcards. It does not mean indels.) • Representation and final use of the k-mers CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Problem to study here • Using g-gapped k-mer counts as features, compute similarity of two sequences as their inner product • Example (k=2, g=1) • Full set of g-gapped k-mers (* is the wildcard character, which can match any nucleotide): • *AA, *AC, *AG, ..., *TTA*A, A*C, A*G, ..., T*TAA*, AC*, AG*, ..., TT* • Number of possible g-gapped k-mers= k+gCk 4k = 3C2 42 = 48 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Problem to study here • Example (k=2, g=1) (cont’d) • Sequence s1 = ACCGCT • Sequence s2 = TACCGG • Similarity between s1 and s2, sim(s1,s2)= 00 + 01 + 00 + 00 + 00 + 11 + 11 + ...= 8 (see Excel file, also try to verify by yourself) • These similarity values can help separate sequences that belong to a class from those that do not CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Time complexity analysis • For two sequences each with n characters, using the brute-force way of calculation: • Filling each table takes (n-g-k+1) g+kCk = 3n-6 additions when k=2 and g=1 • Linear w.r.t. sequence length • g+kCk can be large when g is large • Computing the inner product takes k+gCk 4k = 48 multiplications, followed by 47 additions • Exponential w.r.t. k CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Speeding up the calculations • Ideas: • The exponential time complexity can be avoided only if sim(s1,s2) can be computed without filling in the two whole tables • When k is large, the tables contain many zeroes that can be ignored CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Using the ideas • Example (k=2, g=1) (cont’d) • Sequence s1 = ACCGCT • New representation: {*CC:1, *CG:1, *CT:1, *GC:1, A*C:1, C*C:1, C*G:1, G*T:1, AC*:1, CC*:1, CG*:1, GC*:1} • Sequence s2 = TACCGG • New representation: {*AC:1, *CC:1, *CG:1, *GG:1, A*C:1, C*G:2, T*C:1, AC*:1, CC*:1, CG*:1, TA*:1} • Looking for common g-gapped k-mers and multiplying the corresponding numbers, sim(s1,s2)= 1 (due to *CC) + 1 (*CG) + 1 (A*C) + 2 (C*G) + 1 (AC*) + 1 (CC*) + 1 (CG*) = 8 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Time complexity analysis • Suppose the new representations can be produced with the help of hash tables, the final calculation involves linear scan of the two lists, each with at most (n-g-k+1) g+kCk entries • (6-1-2+1) 3C2 = 12 entries when n=6, k=2, g=1 • Can be slow when g and k are large • For example, with k=6, g=8, g+kCk = 3003 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Speeding up further • Another idea: • Some g-gapped k-mers are related, and their corresponding calculations can be grouped • For example, s1[3-5] = CGC and s2[4-6] = CGG • g-gapped k-mers involved: • s1[3-5]: {*GC, C*C, CG*} • s2[4-6]: {*GG, C*G, CG*} • Similarity between s1 and s2 due to these sub-sequences: 1 (due to CG*) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Speeding up further • Given two length-(k+g) sub-sequences from s1 and s2 (e.g., CGC and CGG), how much do they contribute to sim(s1,s2)? • Important observation: The answer depends only on their number of mismatches • In this case, there is one mismatch between CGC and CGG, and the corresponding contribution to the similarity between s1 and s2 is 1 • In the same way, between s1[2-4]=CCG and s2[4-6]=CGG, since they have one mismatch, the contribution is also 1 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Computing the contribution • For any two length-(k+g) sub-sequences s1[i1-j1] and s2[i2-j2] with m mismatches • There are in total k+gCk ways to generate g-gapped k-mers from each of them, by choosing k non-gapped positions • For a particular choice of the k positions, if they do not involve any of the mismatch positions, their contribution to sim(s1,s2) is 1 • Otherwise, their contribution is 0 • Therefore, their total contribution to sim(s1,s2) is the number of ways to choose the k positions such that none of them is a mismatch position • The total number of ways is k+g-mCk if k+g-mk (i.e., gm); 0 otherwise CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Computing the contribution • A bigger example: suppose k=2, g=2 • s1[2-5] = CCGC • s2[3-6] = CCGG • Previous way of calculating their contribution to sim(s1,s2) : • g-gapped k-mers of s1[2-5]: {**GC, *C*C, *CG*, C**C, C*G*, CC**} • g-gapped k-mers of s2[3-6]: {**GG, *C*G, *CG*, C**G, C*G*, CC**} • Contribution (number of common g-gapped k-mers): 3 • New way of calculating their contribution to sim(s1,s2): • Number of mismatches between s1[2-5] and s2[3-6]: 1 • Contribution: k+g-mCk = 3C2 = 3 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Complete algorithm • Extract all (k+g)-mers from s1 and s2 • For each pair of (k+g)-mer taken from s1 and s2 respectively, compute their contribution to sim(s1,s2) • Sum all these contributions to get final value of sim(s1,s2) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Complete example Number of mismatches: • Back to k=2, g=1 • Sequence s1 = ACCGCT • Sequence s2 = TACCGG • Extract all 3-mers • s1: {ACC, CCG, CGC, GCT} • s2: {TAC, ACC, CCG, CGG} • For each pair of 3-mers,compute theircontribution to sim(s1,s2) • Therefore, sim(s1,s2) = 3+3+1+1=8 Contributions to sim(s1,s2) : CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Time complexity analysis • For each length-n sequence, there are n-k-g+1 sub-sequences of length k+g • Therefore, there are (n-k-g+1)2 pairs of (k+g)-mers from the two sequences • For each pair, the number of mismatches can be computed by scanning the two (k+g)-mer ones • Can speed up using bitwise XOR operations • The total amount of time required is (k+g)(n-k-g+1)2 • Depends more on n but not so much on k and g CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Speeding up even further? • Possible to avoid considering all (k+g)-mer pairs from the two sequences, but just those with less than g mismatches • Won’t go into the details here (see further readings) Image credit: Ghandi et al., PLOS Computational Biology 10(7):e1003711, (2014) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 2 Introduction to Statistical Modeling

Statistical modeling • We have studied many biological concepts in this course • Genes, exons, introns, ... • We want to provide a description of a concept by means of some observable features • Sometimes it can be (more or less) an exact rule: • The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC • In most cases it is not exact: • If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA, and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene • If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The examples • Reasons for the descriptions to be inexact: • Incomplete information • What mutations on BRCA1/BRCA2? Any mutations on other genes? • Exceptions • “If one has fever, he/she has a flu” – Not everyone with a flu has fever, also not everyone with fever is due to a flu • Intrinsic randomness CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Features known, concept unsure • In many cases, we are interested in the situation that the features are observed but whether a concept is true is unknown • We know the sequence of a DNA region, but we do not know whether it corresponds to a protein coding sequence • We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer • We know a subject is having fever, but we do not know whether he/she has flu infection or not CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Statistical models • Statistical models provide a principal way to specify the inexact descriptions • For the flu example, using some symbols: • X: a set of features • In this example, a single binary feature with X=1 if a subject has fever and X=0 if not • Y: the target concept • In this example, a binary concept with Y=1 if a subject has flu and Y=0 if not • A model is a function that predicts values of Y based on observed values X and parameters  CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Parameters • Some details of a statistical model are provided by its parameters,  • Suppose whether a person with flu has fever can be modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to have fever is q1 and the probability not to have fever is 1-q1. • Different people are assumed to be statistically independent. • Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2 • Finally, the probability for a person to have flu is p • Then the whole set of parameters is  = {p, q1, q2} CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Basic probabilities • Pr(X)Pr(Y|X) = Pr(X and Y) • If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop • Capital letters mean it is true for all values of X and Y • Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X and Y • Law of total probability: (The summation should consider all possible values of Y) • If there is • A 0.12 chance that it will both rain and have a temperature drop tomorrow, and • A 0.08 chance that it will both rain and not have a temperature drop tomorrow • Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow • Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y)  0 • Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y) • Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z)  0 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

A complete numeric example • Assume the following parameters  (X: has fever or not; Y: has flu or not): • 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7 • 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1 • 20% of people have flu: Pr(Y=1) = 0.2 • We have a simple model to predict Y from  andX: • Probability that someone has fever: Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22 • Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)= (0.7)(0.2) / 0.22 = 0.64 • Probability that someone does not have flu, given that he/she has fever: Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36 • Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) = Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08 • Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Statistical estimation • Questions we can ask: • Given a model, what is the likelihood of the observation? • Pr(X|Y,) – in the previous page,  was omitted for simplicity • If a person has flu, how likely would he/she have fever? • Given an observation, what is the probability that a concept is true? • Pr(Y|X,) • If a person has fever, what is the probability that he/she has flu? • Given some observations, what is the likelihood of a parameter value? • Pr(|X), or Pr(|X,Y) if whether the concept is true is also known • Suppose we have observed that among 100 people with flu, 70 have fever. What is the likelihood that q1 is equal to 0.7? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Statistical estimation • Questions we can ask (cont’d): • Maximum likelihood estimation: Given a model with unknown parameter values, what parameter values can maximize the data likelihood? • or • Prediction of concept: Given a model and an observation, what is the concept most likely to be true? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Generative vs. discriminative modeling • If a model predicts Y by providing information about Pr(X,Y), it is called a generative model • Because we can use the model to generate data • Examples: Naïve Bayes • If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model • Example: Logistic regression CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Classification vs. regression • If there is a finite number of discrete, mutually exclusive concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to predict whether the subject will develop breast cancer in the life time • Y=1: the subject will develop breast cancer; • Y=0: the subject will not develop breast cancer • If Y takes on continuous values, it is a regression problem and the model is called an estimator • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to estimate the lifespan of the subject • Y: lifespan of the subject CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Bayes classifiers • In the example of flu (Y) and fever (X), we have seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule: • We use capital letter to represent variables (single-valued or vector), and small letters to represent values • When we do not specify the value, it means something is true for all values. For example, all the followings are true according to Bayes’ rule: • Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1) • Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0) • Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1) • Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology • Pr(Y) is called the prior probability • E.g., Pr(Y=1) is the probability of having flu, without considering any evidence such as fever • Can be considered the prior guess that the concept is true before seeing any evidence • Pr(X|Y) is called the likelihood • E.g., Pr(X=1|Y=1) is the probability of having fever if we know one has flu • Pr(Y|X) is called the posterior probability • E.g., Pr(Y=1|X=1) is the probability of having flu, after knowing that one has fever CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Generalizations • In general, the above is true even if: • X involves a set of features X={X(1), X(2), ..., X(m)} instead of a single feature • Example: predict whether one has flu after knowing whether he/she has fever, headache and running nose • X can take on more than 2 values, or even continuous values • In the latter case, Pr(X) is the probability density of X • Examples: • Predict whether a person has flu after knowing the number of times he/she has coughed today • Predict whether a person has flu after knowing his/her body temperature CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Parameter estimation • Let’s consider the discrete case first • Suppose we want to estimate the parameters of our flu model by learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set • How many parameters are there in the model? • We need to know the prior probabilities, Pr(Y) • Two parameters: Pr(Y=1), Pr(Y=0) • Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter • We need to know the likelihoods, Pr(X|Y) • Suppose we have m binary features, fever, headache, running nose, ... • 2m+1 parameters for all X and Y value combinations • 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is one • Total: 2(2m-1) + 1 independent parameters • How large should n be in order to estimate these parameters accurately? • Very large, given the exponential number of parameters CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

List of all the parameters • Let Y be having flu (Y=1) or not (Y=0) • Let X(1) be having fever (X(1)=1) or not (X(1)=0) • Let X(2) be having headache (X(2)=1) or not (X(2)=0) • Let X(3) be having running nose (X(3)=1) or not (X(3)=0) • Then the complete list of parameters for a generative model is (variables not independent are in gray): • Pr(Y=0), Pr(Y=1) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=0),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Why having many parameters is a problem? • Statistically, we will need a lot of data to accurately estimate the values of the parameters • Imagine that we need to estimate the 15 parameters on the last page with only data about 20 people • Computationally, estimating the values of an exponential number of parameters could take a long time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Conditional independence • One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then • Pr(X(1), X(2)|Y)= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption] • Probability for a flu patient to have fever is independent of whether he/she has running nose • Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2)) • Without knowing whether a person has flu, having fever and having running nose are definitely correlated CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Conditional independence and Naïve Bayes • Number of parameters after making the conditional independence assumption: • 2 prior probabilities Pr(Y=0) and Pr(Y=1) • Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0) • 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y • Only 2m independent parameters, as Pr(X(j)=1|Y=y) = 1 - Pr(X(j)=0|Y=y) for all possible values of j and y • Total: 2m+1 independent parameters, which is much smaller than 2(2m-1)+1! • The resulting model is usually called a Naïve Bayes model CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Estimating the parameters • Now, suppose we have the known examples (X1, Y1), (X2, Y2), ..., (Xn, Yn) in the training set • The prior probabilities can be estimated in this way: • , where 𝕀 is the indicator function,with𝕀(true) = 1 and 𝕀(false) = 0 • That is , fraction of examples with class label y • Similarly, for any particular feature X(j), its likelihoods can be estimated in this way: • That is, fraction of class y examples having value x at feature X(j) • To avoid zeros, we can add pseudo-counts: • , where c has a small value CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Example • Suppose we have the training data as shown on the right • How many parameters does the Naïve Bayes model have? • Estimated parameter values using the formulas on the last page: • Pr(Y=1) = 3/8 • Pr(X(1)=1|Y=1) = 2/3 • Pr(X(1)=1|Y=0) = 2/5 • Pr(X(2)=1|Y=1) = 1/3 • Pr(X(2)=1|Y=0) = 1/5 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Meaning of the estimations • The formulas for estimating the parameters are intuitive • In fact they are also the maximum likelihood estimators – the values that maximize the likelihood if we assume the data were generated by independent Bernoulli trials • Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever • This likelihood can be expressed as • That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product • Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b) • This value can be found by differentiating the log likelihood and equating it to zero: • The formula for estimating the prior probabilities Pr(Y) can be similarly derived CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Sequence Motif Modeling for Bioinformatics: A Comprehensive Overview