380 likes | 487 Vues
In this session, we will explore the complexities of genomic comparisons, focusing on chromosomes, sequence comparison techniques, and alignment metrics. We'll cover essential topics like the human genome structure, the importance of double-helix/base pairing, and the functionality of genes. Engage in discussions, quizzes, and challenges, while addressing critical questions about the lengths and functions of proteins and genes. Learn about whole-genome comparison metrics and methods used to analyze genetic similarities among various organisms.
E N D
Recap • Don’t forget to • pick a paper and • Email me • See the schedule to see what’s taken • http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html
Agenda • Questions for you (10 minutes) • Overview (40 minutes) • chromosomes • sequence comparison • string matching • alignment • Quiz (25+ minutes)
Questions for you • List two different functions performed by genes? • What is the length of the human genome? • Why is the double-helix/base-pairing so important?
Questions for you • Protein sequences are composed of a chain of what? • How many different amino acids are found in proteins? • Proteins always form in a helix shape (True or False)?
Questions that would stump Dr. B. • What is the lower limit on the length of a functional protein? • 10-20 • 40-50 • 60-70 • 100 • What is the upper limit on the length of proteins found in cells • 100’s • 1000’s • 1000000’s
Questions that would stump Dr. B. • What is average length of a human gene? • 300 • 3000 • 30,000 • Approximately, how many genes are in the human genome? • 400 • 4000 • 40,000 • 400,000 • 4,000,000
Acid Sugar Sugar Sugar Sugar Sugar Sugar Sugar Sugar A C A A T T T G Acid Acid Acid Acid Acid Acid Rememberthis picture? Acid
Chromosomes • DNA molecule and associated proteins • The 3,000,000,000 nucleotide human genome is divided among • 22 pairs of autosomes and • 1 pair of sex chromosomes • Together the 23 chromosomes carry all the hereditary information of an organism.
DNA Sequence Comparison • Overview • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Whole Genome Comparison • Problem: Exactly how similar are two different genomes? • Given a set of genomes • which two are most similar • which two are least similar
G2 G5 G4 G3 G1 Whole Genome Comparison • Ranking a set of genomes based on similarity gives us clues about • heredity • evolution Similarity Rank G2 G5 0.99 G3 G1 0.97 G4 G5 0.91 G4 G2 0.90 G4 G1 0.80 G4 G3 0.78 G2 G1 G3 G4 G5
Whole Genome Comparison • Solution: Design a metric that quantifies similarity • something you can measure or • something you can compute • that accurately quantifies similarity
Whole Genome Comparison • But what does it really mean for two genomes to be similar? • Obviously, if two genomes exactly match then they are similar • But, what’s more important • rough, overall similarity, or • exact, local similarity • A picture will explain
Whole Genome Comparison • Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA
GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA Whole Genome Comparison • Rough overall similarity • 2 Mismatched pairs • 4 unmatched nucleotides
Whole Genome Comparison • Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT
Whole Genome Comparison • The first metric: Edit Distance • The number of edit operations needed to make the two sequences equal • Edit Distance was previously used in • Spell checkers • Approximate database searching
Edit Distance • 3 edit operations • delete a symbol • insert a symbol • modify a symbol • modify = delete + insert • modify counts as two edit operations
Edit Distance • What is the edit distance between these two sequences? • Note: edit distance implies the minimum number of basic edit operations needed to make the string equal • ERICWASABIGNERDERICSTILLISANERD • ERICWASABIGNERD (5 deletions) • ERICSTILLISANERD (6 deletions)
Edit Distance • ERICWASABIGNERD (15 symbols)ERICSTILLISANERD (16 symbols) • ERICWASABIGNERD (5 deletions)ERICSTILLISANERD (6 deletions) • Metrics • Matches 10 / Smaller Sequence 15 = 66% • (Edits 11 – Symbols 31) / Symbols 31 = 64%
Edit Distance • There are problems with edit distance • It doesn’t properly reward exact local similarity • which is often a true sign of biological similarity • Similar organisms often share a lot of similar genes • But may have a few genes that don’t match at all • Biologists need a metric that can reflect this type of situation
Edit Distance • Another problem • Two organisms might have almost identical DNA • Except one has extra segments • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%
Edit Distance • How is it possible that two metrics based on the same principle (edit distance) could produce such different results? • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%
Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Gene Search • Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. • They want to know • Is this segment a gene? • Advantage: Genes are similar across different organisms. • Two organisms that do the same exact function are likely to have a nearly-exact gene.
Gene Search • Solution: • Take your newly sequenced segment • And search all the previously sequenced genomes. • Find segments (in other genomes) that highly match your segment. • Advantage: • Other genomes are marked-up • Segments that are known to be genes are labeled • If your segment matches a known gene then BAM! • You’ve found a gene in a previously un-sequenced organism.
Gene Search • Obviously, you want to search for a segment that is highly similar to your target segment. • However, this type of comparison is completely different than whole genome comparison • What is the fundamental difference?
Gene Search vs. Whole Genome Comparison • Whole genome comparison considers sequences in their entirety • Two sequences • Beginning to End
Gene Search vs. Whole Genome Comparison • Gene search doesn’t consider the entire search sequence when evaluating similarity • Two sequences • Target (the segment you sequenced) • Search Sequence (possibly a genome)
Gene Search • You want to find a sub-segment of the search sequence that highly matches the target sequence. • The entire search sequence is analyzed • But in evaluating similarity, we don’t need to consider the search sequence in its entirety • Looking for localized similarity
Gene Search • How do you even know that your newly sequenced segment is a gene? • Perhaps only part of it is a gene and the rest is junk.
Gene Search • Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. • Writing an algorithm to find such matches is hard
Gene Search • Writing such algorithms required coordination between • Biologists • Who have some clues about true biological similarity • And Computer Scientists • Who have some clues about what problems can be solved efficiently and reliably.
Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Next Class • Motif discovery (computer science perspective) • Alignment (the technique used to measure similarity) • Global alignment • Local alignment • Scoring matrices
Homework • Pick a paper! Email me. • Read pages 159-172