Aligning Transcribed Sequences to the Human and Mouse Genomes

Aligning Transcribed Sequences to the Human and Mouse Genomes Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania

The Transcribed Sequences • dbEST expressed sequence tags (ESTs) • ~4 million human • ~2.5 million mouse • Highly variable quality • GenBank mRNAs and RefSeqs • Many are “full length”, high quality • Includes RIKEN cDNAs • Did not include GenBank HTC division

DoTS: Database of Transcribed Sequences • Cluster ESTs & mRNAs by similarity • Assemble the clusters with CAP4 • Goal is to produce one sequence per transcript • Annotate resulting consensus seqs. • Predict protein sequences • Run BLAST searches • Predict GO function • Link to RH maps, gene trap cell lines, expression data, MGI, GeneCards, etc. • Results at http://www.allgenes.org

A Sample DoTS Assembly

DoTS “Singletons” • Sequences that do not assemble with anything else in the database • Singletons are usually ESTs • Represent either 5’ or 3’ end of a gene

The Genomes: Human • Recent events • June 2000: “working drafts” announced • Feb. 2001: first analyses published • Feb. 2002: UCSC exits assembly business • Current public draft sequence • July, 2002: NCBI Build #30 • June 28, 2002 freeze of GenBank data • 87% finished seq., est. 94-97% coverage

The Genomes: Mouse • Recent events (public sequence) • Late 2000: shotgun sequencing begun • Late 2001: first assemblies created • April 2002: Arachne chosen over Phusion • Current public draft sequence • April, 2002: MGSCv3 • February, 2002 freeze of ~7X shotgun • Estimated 90-95% coverage

Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA)

Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA) exon 1 exon 2 exon 3 *** DRAMATIZATION ***

What are the goals? • Find genes & delineate their boundaries • Investigate alternative splicing • Validate DoTS assemblies • Gain insight into sources of error • Assess whether anything is gained by assembling ESTs before aligning them

Potential “unsplicing” tools • BLAST • Good general-purpose local alignment tool • But not well-suited to this specific task • Special-purpose alignment tools • e.g., est2genome (Birney, Durbin), est_genome (Mott), sim4 (Florea et al.) • Perform well, but are very slow

Unsplicing: a first attempt • BLAST-sim4 heuristic algorithm • Employs a two-step approach • BLASTN - find candidate locations • sim4 – perform precise alignments • Much faster than sim4 alone • But still slow for whole-genome analysis • Similar in spirit to Spidey (Wheelan et al.), post-processes BLASTN results

Unsplicing: BLAT • BLAT: BLAST-Like Alignment Tool • Written by Jim Kent at UCSC • Indexes target db, not query sequence • Takes advantage of additional constraints • Adjusts exon boundaries using splice signals • Attempts to locate small exons • 500x speedup with no loss of sensitivity

Overview of alignment process • BLAT RefSeq mRNAs + DoTS sequences against respective genomes • Load alignments into database • Compute summary information • Including alignment “quality” • Merge selected alignments into “genes” • Eliminates redundancy in DoTS • Provides estimate of total gene number

BLAT Alignments: first step • Default parameters, repeats masked • All with >=10% of query loaded into db • Summary information computed • e.g., max_query_gap, max_target_gap • polyA tails detected, 3’ and 5’ (!) • Alignment quality

Alignment Quality • This results in many alignments • How to identify those that represent the actual location(s) of each transcript? • Assuming that: • The transcribed sequence is real • The corresponding genomic sequence(s) is/are accurate and complete • Use a heuristic approach

Defining Alignment Quality • (1) “Very good” • >= 95% average sequence identity • max_query_gap <= 5 bp • Both ends are consistent: • no more than 10 bp mismatch unless polyA • polyA rule cannot be used on both ends

Control experiment #1 • Compared: • “Very good” RefSeq alignments to hChr22/mChr5 • mRNA alignments in UCSC annotation database • FP: ~0 FN: ~18% and ~35% • (2) “Very good, but with gaps” • Same as “very good” but mismatches are allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length at the ends.) • New false negative rates: ~15% and 13%

Control experiment #2 • RefSeqs that had “very good” alignments alone, but not when assembled with other sequences: • hChr22: 98/255 (38%) • mChr5: 109/271 (40%) • Mostly due to problems at ends of DoTS seqs. • (3) “Good” • Same as “very good w/ gaps” but allow: • max_query_gap <= 15 bp (vs. 5 bp) • Up to 50 bp of mismatch at each end (vs. 10 bp) • Reduces to 25/255 (~10%) and 33/271 (~12%)

Alignment statistics: human • hDoTS (08/02) vs. human genome (NCBI 30) • Total DoTS sequences: 859,545 (~230,000) • Alignments loaded: 5,544,300 / 8,975,529

Alignment statistics: mouse • mDoTS (07/02) vs mouse genome (MGSCv3) • Total DoTS sequences: 579,906 (~129,000) • Alignments loaded: 3,208,572/4,663,903

Merging adjacent/overlapping alignments into “genes” • Select BLAT alignments • Parameters: min. quality, min_target_gap • Merge overlapping alignments • Merge nearby alignments where an assembly in each has an EST from a common clone • Parameter: max distance (500 kb) • Merge nearby alignments • Parameter: max distance (75 bp) • Only merge alignments on the same strand • Identify genes with an intron of at least 15bp

Algorithm Calibration • Human chr22q (~34Mb) as test case • Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) • Focus on DiGeorge Critical Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-33 genes based on literature (Sanger: 44 genes with 33 known) *Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).

Results - human

Results - mouse

Known problems/issues • Incorrectly oriented DoTS assemblies • Distinguishing single-exon genes from genomic contaminants, antisense and/or functional non-coding RNAs • Large number of ESTs have no alignments at all [above 10% threshold] • Currently investigating why this is so…

Current and future work • Detailed assessment of results in 14Mb of mouse chr. 5 (CBIL + Bucan lab.) • Augment alignments with other sequence signals (Hatzigeorgiou lab.) • Incorporate alignments into DoTS build process from the outset

Acknowledgements • BLAT Alignments/Gene Merging • Yongchang Gan (see poster!) • Database of Transcribed Sequences (DoTS) • Brian Brunk, Steve Fischer, Deborah Pinney • Mouse Chr. 5 annotation project • Joan Mazzarelli • Maja Bucan lab. • Artemis Hatzigeorgiou lab. • Chris Stoeckert (PI, CBIL)

Is EST assembly still relevant? • Not every organism has genome project • EST sequencing is still a relatively cheap way to survey a transcriptome • Though array-based approaches are also very powerful, if the sequence is known • Not every EST will necessarily align to the draft genome; may want to cluster the rest • Annotation component of DoTS is useful, regardless of the assembly method

Aligning Transcribed Sequences to the Human and Mouse Genomes

Aligning Transcribed Sequences to the Human and Mouse Genomes

Presentation Transcript

The Art of Fengshui Aligning the Human and Natural Realms

Lion, Fox, Mouse and Human

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner

Methods and challenges in the analysis of admixed human genomes

The bonobo genome compared with the chimpanzee and human genomes

Using mouse genetics to understand human disease

The Human Genomes

Corresponding transcribed slides to the project

Somatic alterations in human cancer genomes

Molecular evolution, cont. Comparing estimation methods. Application to human and mouse sequences

Percentage of Domain Sequences in Genomes

Introduction to genomes

Targeted Sequencing of Human Genomes, Transcriptomes, and Methylomes

The Town Mouse and the Country Mouse

Multiply Aligning RNA Sequences

Human-Mouse Cross Reactive

Aligning Sequences With T-Coffee

Human protein reference sequences

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes

Genomes To Life

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner

Multiply Aligning RNA Sequences