Bioinformatics & Genomics

Bioinformatics & Genomics Chemistry 160 / 260 Spring 2001 Chris Lee Boyer Hall (MBI) 601

Course Goals • Understand conceptual foundations of genomics and bioinformatics, via examples of what has been done; • Apply these principles towards inventing new kinds of bioinformatics; • Understand the role of bioinformatics in analyzing and interpreting genomics data, and how biology questions map to computational problems; • Get acquainted with existing data and tools.

Course Overview • Lecture Syllabus and Structure • Reading • Homework • Discussions • Hands-on bioinformatics activities • Grading

A Recipe for Genomics • What is genomics? • What is bioinformatics? • How does research in this new field differ from previous research (e.g. molecular biology)? • What are the necessary ingredients for success in this new science?

Definitions • Genomics: genome-scale experimental analysis of biological systems • systematic, comprehensive • automated, high-throughput • Bioinformatics: the analysis and interpretation of genomics data • computational by necessity • find the information content in the data • probabilistic, data-driven

What is Genomics? G=(MB)n Genomics is any (molecular) biology experiment taken to the whole genome scale. • Ideally in a single experiment. • E.g. genome sequencing. • E.g. DNA microarray analysis of gene expression. • E.g. mass spectrometry protein mixture analyses: quantity, phosphorylation, etc.

From ~Nothing to ~Everything Molecular Biology: "one gene one protein", “one gene one lab”… “one gene one Ph.D.”… • Genomics: “one genome, one Ph.D”??E.g. Sorel Fitzgibbon, P. aerophilum

From ~Nothing to ~Everything • Molecular Biology: "one gene one protein", "one gene one lab", "one gene one Ph.D."… • Genomics: "one genome, one Ph.D"?? • Molecular Biology = cloning new genes. What do you do when all the genes are cloned, sequenced, and sitting in Genbank? • E.g. 2 years ago, 20% of human coding region sequences were available; This year ~ 80-90%.

How Sequencing went from Gene to Genome • Conventional “manual” sequencing • all you needed was P32, a pipetman, and a gel, and you could sequence a gene--several kb in a few weeks. • “Automated” sequencing technology (1996) • 4 color fluorescence in a single lane. • 144 kb/day raw reads per machine

Genomics Foundation: High Throughput Technology • Automation: any human step is a bottleneck. • Multiplexing & parallelization. • Miniaturization. • Read-out speed, sensitivity. • “GMP” Q/A, reproducibility, “production line” mindset.

Laser Dye Based Sequencing

Four-Color Sequencing

Automated Lane Tracking

Automated Trace Analysis

Automated Base Calling

The New Face of Biology?

DNA Prep Automation

DNA Prep Automated “Assembly Line”

XY-arm Robot w/ 96 well Plates

Microarray Plating Automation

The Human Genome Project • 1953-91: <2000 human genes identified • Ten Years to do >3 GBases, 30,000 genes • Controversy: “Nobody would know what to do with this data”. “junk DNA”? • Big Science: Impact on “Little Science” model of NIH funded research? • 90% is available as “draft” sequence (Feb. 2001, separately from NIH & Celera).

EST Sequencing • Purify poly-A mRNA, make cDNA, sequence from 3’ or 5’ end. • Quickly zeroes in on the tiny fraction of the genome of most interest to biologists: the coding regions of genes. • Rapid efforts to “stake a claim” on novel genes: Venter’s EST patenting at NIH; Incyte’s LifeSeq Database. • Should an EST sequence be patentable?

Towards a Human Gene Index • >2 M ESTs ~400 bp each. “dbEST” • 80,000 Unigene clusters including many singletons; • 10,000 correspond to known genes • 50-70% of positionally cloned genes match an EST in dbEST (1998). • 18,000 genes mapped via STSs, ~ 200 kb spacing on genome.

Complete Genome Sequences • 1995: shotgun sequencing of H. influenzae, 1.8 Mb; M. genitalium 0.6 Mb. • 1996: S. cerevesiae, 13 Mb. • 1998: C. elegans, 100 Mb. • 2000: Drosophila (120 Mb), human (3 Gb). • 2001: mouse; >100 complete genome sequences, mostly microbial.

The Yeast Genome: an Overview • before: 1000 genes defined over decades • after: >6000 genes, 56% novel, 39% mysterious. • Classification of functions: “annotation” • 50% of human disease genes have some similarity to a yeast ORF • At least 10 large duplications from one chromosome to another (7 - 170 kb each)

The Continuing Saga of the Human Genome • the hard problem: covering, assembling the genome from many pieces; repetitive regions. • minimum tiling, library requirements • YACs, BACs, maps • mapping efforts, not sequencing, rate-limiting • “walking” via map markers, vs. BAC-end sequence tagged connectors. • 150,000+ human genes vs 32,000 (latest estimate)

Sequencing from a Map

In Genomics every question is really an information problem • In molecular biology, experiments are small and designed to test a specific hypothesis clearly and directly. • In genomics, experiments are massive and not designed for a single hypothesis. • Every biology question about genomics data corresponds to a computer science problem: how to find the desired pattern in a dataset.

Anti-climax: How to turn Data into Discoveries?!? atcgtacgtacgtagctatgcatgctagctagtcattctctactcaccacagtgctacgtactgttggacatcgtatagtatttatcgatctatgtcagtactttaggtagaacgatgtgattctacctatgttggtatatcgat... When you get massive data, what do you do with it? What does it mean? Problem: no human being could ever look at all this data. So the patterns must be discovered computationally.

Example: Microarray analysis of cell-cycle regulated genes • All dividing cells follow a cycle with distinct stages of growth (G1, G2), DNA synthesis (S), and mitosis (M). • Biologists have identified many genes that function in a specific stage, and other genes that turn those genes on at the right time. • Why not analyze this regulation for all genes over the whole cell cycle?

Tracking the Yeast Cell Cycle

Synchronizing the Cycle

Expression Patterns for Known Genes

Computational Challenges • Cluster genes by expression pattern over the course of the cell cycle. • Identify groups of genes that are co-expressed, co-regulated. • Identify regulatory elements in common to the promoters of these genes, that make them be expressed at the same time.

Automated Discovery of Cell Cycle Regulatory Elements

Solving the Information Problem • Modeling the problem: choosing what to include, and how to describe them. • Relating this to known information problems. • Algorithms for solution. • Complexity: amount of time & memory the algorithm requires.

Genomics Requires Statistical Measures of Evidence • Evaluate competing hypotheses under uncertainty--automatically? • based on statistical tendencies, not “proofs” • false positives, false negatives • the need for cross-validation • the need for experimental validation • best role: experiment interpretation and planning

Completeness Changes Everything • In molecular biology cleverness is finding a way to answer a question definitively by ignoring 99.99% of genes. You can’t see them, so the experiment must exclude them. • In genomics cleverness is discovering what becomes when possible when you can see everything. Have to switch our deepest assumptions.

Example: Ortholog Prediction • Orthologs: two genes related by speciation events alone. “the same gene in two species”, typically, same function. • Paralogs: two genes related by at least one gene-duplication & divergence event. Homology: an ortholog or a paralog? • Experimentally very hard to answer.

Gene Evolution through Speciation vs. Duplication speciation gene duplication orthologs paralogs

Using Multiple Genomes for Ortholog Cross-Validation • Within a given genome, ortholog should be more similar than paralog (same time of divergence, but divergent functional pressure on paralog). • Completegenome: if there’s an ortholog, you’ll find it! • Multiple genomes: quickly sort out lack-of-ortholog, multiple paralogs, gene-loss problems that could lead you astray.

Cluster of Orthologous Genes Orthologs should be reciprocal best hits. ? Tatusov, Koonin & Lipman, Science278, 631 (1997)

Reciprocal Best Hits Indicate Orthologs Reciprocal best hit Non-reciprocal best hit

Multiple Genomes Screen Out Errors True orthologs should give consistent, reciprocal best-hit pattern as more genomes added. ? Chance of missing a true ortholog, or predicting an incorrect “ortholog”, exponentially decrease as more genomes added.

Simple Ortholog Cluster (found in three genomes) KatG YKR066c sII1987 Catalase peroxidase in E. coli, yeast, Synechocystis, from Tatusov, Koonin & Lipman, Science278, 631 (1997)http://www.sciencemag.org/cgi/content/full/278/5338/631

From Reductionism to Systems Analysis • Mol. Biol.: dissect a complex phenomenon into its smallest pieces; characterize each. • Very hard to put the pieces back together again: Given AB, AC: A+B+C = ? • Genomics: The cell as test-tube. Able to see A+B+C (+D+E+…) working together. • Study how all the components work together as a system. Study system behavior.

From Hypothesis-Driven to Data-Driven Science • Mol. Biol.: can’t see 99.99% of genes, so use black-box logic based on controls: keep everything the same except for one small change. Isolate a specific cause-effect. • In reality you rarely have the perfect control. • Hypothesis driven: can only see what you look for: a few genes, a few controls. • Interpretable: ask a YES-NO question.

From Hypothesis-drivento Data-driven Science • Genomics: measure all genes at once. • Don’t have to assume a hypothesis as basis for designing the experiment. • Objective: let the data speak for themselves. • Reality: vast amounts of data, very complex, hard to interpret. “System Science” or just “Stupid Science”?

Stupid Science: Data-driven Science Done Wrong • No hypothesis. • Assumptions: alternative models not explicitly enumerated, weighed. • Statistical basis of model either neglected or only implicit (and therefore poor). • No cross-validation: just one form of evidence. • Greedy algorithms, sensitive to noise. • Measures of significance weak or absent, both computationally and experimentally.

Data-driven Science Done Right • Multiple competing hypotheses. • Alternative models explicitly included, computed, to eliminate assumptions. • Statistical models clear, well-justified. • Multiple, independent types of evidence. • Robust algorithms w/ well demonstrated convergence to global optimum. • Rigorous posterior probability calculated for all possible models of the data. Priors derived from data. False +/- measured.

Bioinformatics & Genomics