Evolution and Genome Analysis with Santa Cruz Genome Browser

Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University

Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7

Known Gene Details Page

PDB Ribbon Diagram 4 clicks away by the wonder of the world wide web

Hox A Cluster, Many Tracks

Track Controls are Now Grouped

Packed mode saves space, makes labels easier to find.

Squished mode is ideal for ESTs and mouse/human homology

Squished mode is ideal for ESTs and mouse/human homology ESTs hint at a smallerversion of exon2

Publication Quality Output

Comparative Genomics

Chaining Alignments • Chaining bridges the gulf between syntenic blocks and base-by-base alignments. • Local alignments tend to break at transposon insertions, inversions, duplications, etc. • Global alignments tend to force non-homologous bases to align. • Chaining is a rigorous way of joining together local alignments into larger structures.

Chains join together related local alignments Protease Regulatory Subunit 3

Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment otherwise non-homologous bases can be forced to pair

2-D histogram of observed gaps. The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.

Before and After Chaining

Chaining Algorithm • Input - blocks of gapless alignments from blastz • Dynamic program based on the recurrence relationship:score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) • Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i

Netting Alignments • Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. • Net finds best match mouse match for each human region. • Highest scoring chains are used first. • Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

Net Focuses on Ortholog

Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Mouse/HumanRearrangement Statistics Number of rearrangements of given type per megabase.

A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Rat Genome year of the rat - 2008

Rat/Mouse/Human Genome-Wide Multiz Alignments Available Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.

Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.

Zoom to Base Level Detail near translation start of tubulin 8

Zoom to Base Level Intron consensus sequence visible.

Zoom to Base Level Possible alt-splice not consensus and not conserved.

Tiling the genome in MicroarraysNew genes on 21 and 22?

Cross-hybridization at Work Zoomed in on right side:

200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters >hg15_rnaCluster_chr22.246 range=chr22:25204375-25204574 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none aactccgcctcggggccccggggcgccgcctctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagct gcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc >hg15_rnaCluster_chr22.234 range=chr22:24125896-24126095 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga >hg15_rnaCluster_chr22.313 range=chr22:29356156-29356355 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none gccctcccggtccgggggcggggcttggcctggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca >hg15_rnaCluster_chr22.337 range=chr22:30433286-30433485 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none actcagaagctaagataccgacggtgttcctctgaacttcttccaatggc taaaagctacaagcgcctcagatataaaagactcctggacggattttcat ccagcacagagcagctgaatccatatttggcagctagtggatgggataag aggcctaacagtaagcccatggcactttattctctcgaatccatcaagat >hg15_rnaCluster_chr22.356 range=chr22:32640965-32641164 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc

Individuals Institutions Acknowledgements Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler, Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins, LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats. Everyone else! NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.

THE END

A Cautionary Note • Infant digestive systems very permeable, uptake antibodies • ~10% of infants are allergic to cow’s milk based formula • These infants get soy/corn based formula • As we engineer plants, let’s be careful what we put in infant formula

New Algorithms and Data • ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements. • Rat genome is browsable and used in rat/mouse/human multiple alignments. • Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.

Ideal Gap Penalties • Would allow gaps in both sequences at once • Would penalize long gaps less than affine gap scores. • Still would be quick to compute. • We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.

Evolution and Genome Analysis with Santa Cruz Genome Browser

Evolution and Genome Analysis with Santa Cruz Genome Browser

Presentation Transcript

UCSC Genome Browser

Santa Cruz Skateboards

The UCSC Genome Browser

Uc Santa Cruz

Mission Santa Cruz

UCSC Genome Browser

Santa Cruz Bicycles

Santa Cruz Mission

Mission Santa Cruz

Savant Genome Browser

UCSC Genome Browser

UC Santa Cruz

Savant Genome Browser

The UCSC Genome Browser

The Human Genome Project at UC Santa Cruz

Genome Browser

Santa Cruz

Santa Cruz Community Counseling Center Santa Cruz, California