1 / 32

Aligning Transcribed Sequences to the Human and Mouse Genomes

Aligning Transcribed Sequences to the Human and Mouse Genomes. Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania. The Transcribed Sequences. dbEST expressed sequence tags (ESTs)

edwardrosa
Télécharger la présentation

Aligning Transcribed Sequences to the Human and Mouse Genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aligning Transcribed Sequences to the Human and Mouse Genomes Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania

  2. The Transcribed Sequences • dbEST expressed sequence tags (ESTs) • ~4 million human • ~2.5 million mouse • Highly variable quality • GenBank mRNAs and RefSeqs • Many are “full length”, high quality • Includes RIKEN cDNAs • Did not include GenBank HTC division

  3. DoTS: Database of Transcribed Sequences • Cluster ESTs & mRNAs by similarity • Assemble the clusters with CAP4 • Goal is to produce one sequence per transcript • Annotate resulting consensus seqs. • Predict protein sequences • Run BLAST searches • Predict GO function • Link to RH maps, gene trap cell lines, expression data, MGI, GeneCards, etc. • Results at http://www.allgenes.org

  4. A Sample DoTS Assembly

  5. DoTS “Singletons” • Sequences that do not assemble with anything else in the database • Singletons are usually ESTs • Represent either 5’ or 3’ end of a gene

  6. The Genomes: Human • Recent events • June 2000: “working drafts” announced • Feb. 2001: first analyses published • Feb. 2002: UCSC exits assembly business • Current public draft sequence • July, 2002: NCBI Build #30 • June 28, 2002 freeze of GenBank data • 87% finished seq., est. 94-97% coverage

  7. The Genomes: Mouse • Recent events (public sequence) • Late 2000: shotgun sequencing begun • Late 2001: first assemblies created • April 2002: Arachne chosen over Phusion • Current public draft sequence • April, 2002: MGSCv3 • February, 2002 freeze of ~7X shotgun • Estimated 90-95% coverage

  8. Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA)

  9. Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA) exon 1 exon 2 exon 3 *** DRAMATIZATION ***

  10. What are the goals? • Find genes & delineate their boundaries • Investigate alternative splicing • Validate DoTS assemblies • Gain insight into sources of error • Assess whether anything is gained by assembling ESTs before aligning them

  11. Potential “unsplicing” tools • BLAST • Good general-purpose local alignment tool • But not well-suited to this specific task • Special-purpose alignment tools • e.g., est2genome (Birney, Durbin), est_genome (Mott), sim4 (Florea et al.) • Perform well, but are very slow

  12. Unsplicing: a first attempt • BLAST-sim4 heuristic algorithm • Employs a two-step approach • BLASTN - find candidate locations • sim4 – perform precise alignments • Much faster than sim4 alone • But still slow for whole-genome analysis • Similar in spirit to Spidey (Wheelan et al.), post-processes BLASTN results

  13. Unsplicing: BLAT • BLAT: BLAST-Like Alignment Tool • Written by Jim Kent at UCSC • Indexes target db, not query sequence • Takes advantage of additional constraints • Adjusts exon boundaries using splice signals • Attempts to locate small exons • 500x speedup with no loss of sensitivity

  14. Overview of alignment process • BLAT RefSeq mRNAs + DoTS sequences against respective genomes • Load alignments into database • Compute summary information • Including alignment “quality” • Merge selected alignments into “genes” • Eliminates redundancy in DoTS • Provides estimate of total gene number

  15. BLAT Alignments: first step • Default parameters, repeats masked • All with >=10% of query loaded into db • Summary information computed • e.g., max_query_gap, max_target_gap • polyA tails detected, 3’ and 5’ (!) • Alignment quality

  16. Alignment Quality • This results in many alignments • How to identify those that represent the actual location(s) of each transcript? • Assuming that: • The transcribed sequence is real • The corresponding genomic sequence(s) is/are accurate and complete • Use a heuristic approach

  17. Defining Alignment Quality • (1) “Very good” • >= 95% average sequence identity • max_query_gap <= 5 bp • Both ends are consistent: • no more than 10 bp mismatch unless polyA • polyA rule cannot be used on both ends

  18. Control experiment #1 • Compared: • “Very good” RefSeq alignments to hChr22/mChr5 • mRNA alignments in UCSC annotation database • FP: ~0 FN: ~18% and ~35% • (2) “Very good, but with gaps” • Same as “very good” but mismatches are allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length at the ends.) • New false negative rates: ~15% and 13%

  19. Control experiment #2 • RefSeqs that had “very good” alignments alone, but not when assembled with other sequences: • hChr22: 98/255 (38%) • mChr5: 109/271 (40%) • Mostly due to problems at ends of DoTS seqs. • (3) “Good” • Same as “very good w/ gaps” but allow: • max_query_gap <= 15 bp (vs. 5 bp) • Up to 50 bp of mismatch at each end (vs. 10 bp) • Reduces to 25/255 (~10%) and 33/271 (~12%)

  20. Alignment statistics: human • hDoTS (08/02) vs. human genome (NCBI 30) • Total DoTS sequences: 859,545 (~230,000) • Alignments loaded: 5,544,300 / 8,975,529

  21. Alignment statistics: mouse • mDoTS (07/02) vs mouse genome (MGSCv3) • Total DoTS sequences: 579,906 (~129,000) • Alignments loaded: 3,208,572/4,663,903

  22. Merging adjacent/overlapping alignments into “genes” • Select BLAT alignments • Parameters: min. quality, min_target_gap • Merge overlapping alignments • Merge nearby alignments where an assembly in each has an EST from a common clone • Parameter: max distance (500 kb) • Merge nearby alignments • Parameter: max distance (75 bp) • Only merge alignments on the same strand • Identify genes with an intron of at least 15bp

  23. Algorithm Calibration • Human chr22q (~34Mb) as test case • Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) • Focus on DiGeorge Critical Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-33 genes based on literature (Sanger: 44 genes with 33 known) *Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).

  24. Results - human

  25. Results - mouse

  26. Known problems/issues • Incorrectly oriented DoTS assemblies • Distinguishing single-exon genes from genomic contaminants, antisense and/or functional non-coding RNAs • Large number of ESTs have no alignments at all [above 10% threshold] • Currently investigating why this is so…

  27. Current and future work • Detailed assessment of results in 14Mb of mouse chr. 5 (CBIL + Bucan lab.) • Augment alignments with other sequence signals (Hatzigeorgiou lab.) • Incorporate alignments into DoTS build process from the outset

  28. Acknowledgements • BLAT Alignments/Gene Merging • Yongchang Gan (see poster!) • Database of Transcribed Sequences (DoTS) • Brian Brunk, Steve Fischer, Deborah Pinney • Mouse Chr. 5 annotation project • Joan Mazzarelli • Maja Bucan lab. • Artemis Hatzigeorgiou lab. • Chris Stoeckert (PI, CBIL)

  29. Is EST assembly still relevant? • Not every organism has genome project • EST sequencing is still a relatively cheap way to survey a transcriptome • Though array-based approaches are also very powerful, if the sequence is known • Not every EST will necessarily align to the draft genome; may want to cluster the rest • Annotation component of DoTS is useful, regardless of the assembly method

More Related