1 / 10

From DoTS Assemblies to Genes via Genomic Alignment

From DoTS Assemblies to Genes via Genomic Alignment. BLAT consensus sequences vs genomic Load alignments with 10% cutoff into GUS Compute alignment “quality”: 1 = Very good 2 = Very good with gaps 3 = Good 4 = Not so good Merge selected alignments into “genes”. BLATAlignmentQuality.

viles
Télécharger la présentation

From DoTS Assemblies to Genes via Genomic Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From DoTS Assemblies to Genes via Genomic Alignment • BLAT consensus sequences vs genomic • Load alignments with 10% cutoff into GUS • Compute alignment “quality”: • 1 = Very good • 2 = Very good with gaps • 3 = Good • 4 = Not so good • Merge selected alignments into “genes”

  2. BLATAlignmentQuality • Very good (formerly “consistent”) • >= 95% identity (average) • max_query_gap <= 5 • both ends consistent • no more than 10bp mismatch unless polyA • not polyA on both ends

  3. BLATAlignmentQuality II • Very good with gaps • same as very good but internal and end mismatches allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length for ends) • Good • same as very good, but with max_query_gap <= 15 (allow large internal gaps if there is a sufficiently large genomic sequence gap), and inconsistent ends allowed if unaligned_bases <= 50 • Not so good • everything else

  4. “Gene” creation algorithm • Select BLAT alignments • Parameters: min quality, genomic region • Merge overlapping alignments • Merge nearby alignments with at least one EST sequence in each assembly from common clone • Parameter: max distance (default 20kb) • Merge nearby alignments • Parameter: max distance (default 20bp)

  5. Human Chromosome 22 • As test case to calibrate algorithm • December 2001 Golden Path release (NCBI build 28?) • Human DoTS February 2002 release (820965 consensus sequences) SQL> select count(*) from blatalignment b, virtualsequence v 2 where b.target_na_sequence_id = v.na_sequence_id 3 and v.external_db_id = 4792 and v.chromosome = '22' 4 and b.target_external_db_id = 4792 and b.query_table_id = 56 5 and b.query_taxon_id = 8; COUNT(*) = 129619

  6. Focus on DiGeorge Critical Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-44 genes based on literature (including latest Sanger annotation) • Number of genes by our algorithm: 47 • Input alignments: very good, multispan • Merge by overlap: on • Merge by clone: 20kb (default) • Merge by proximity: off

  7. Choosing parameters # DiGeorge Chromosome Region (DGCR6 - ZNF74, 1.6Mb) # CBIL Gene Param* Num CBIL* Num Sanger* Num Overlap* Avg %overlap* qf=4, am, cm=10k 27/50 26/44 28 88.7 vs 71.3 qf=4, am, cm=20k 24/47 26/44 27 81.4 vs 75.5 qf=4, am, cm=50k 20/39 25/44 26 63.8 vs 77.6 qf=6, am, cm=10k 26/69 29/44 30 77.7 vs 75.9 qf=6, am, cm=20k 25/66 28/44 31 69.8 vs 80.5 qf=6, am, cm=50k 17/54 24/44 25 53.0 vs 87.4 # Chr22 (Chr22q ~34M) # CBIL Gene Param* Num CBIL* Num Sanger* Num Overlap* Avg %overlap* qf=4, am, cm=20k 335/737 352/829 383 70.7 vs 72.4 qf=6, am, cm=20k 327/1074 377/829 399 64.9 vs 81.0 * qf: is quality filter for choosing Genomic Alignments of desired quality for gene boundary definition. 4: consistent and multi-span, 6: ok and multi-span * am: is alignment overlap mediated merge of DoTS assemblies * cm: clone information mediated merge of DoTS assemblies within specified distance * i/j: j is total number of genes, and i is number of genes with overlap * only overlaps of at least 5% of the genomic length of both genes are counted * Avg %overlap: (1) same as above; (2) first number is w.r.t. CBIL gene, second Sanger.

  8. Mouse Chromosome 5 • February 2002 Golden Path release (MIT Arachne build 3?) • Mouse DoTS January 7, 2002 release (537403 consensus sequences) • ENSEMBL/PHUSION assembly: • Known Ensembl Genes: 826 • Novel Ensembl Genes: 448 • Length: 151006098 bp

  9. Focus on Mouse Chr5 proximal • Telomere to Clock (1-83,965,868) • UCSC RefSeqs: 178 • Number of genes by our algorithm: 449 • Input alignments: very good, multispan • Merge by overlap: on • Merge by clone: 20kb (default) • Merge by proximity: off

  10. In progress • Revised BLATAlignment table • Alignment of new releases of Human DoTS (Mouse already done) • Alignments against Celera scaffolds • Redo gene merge with new alignments: all good and above

More Related