Download
ncbi molecular biology resources n.
Skip this Video
Loading SlideShow in 5 Seconds..
NCBI Molecular Biology Resources PowerPoint Presentation
Download Presentation
NCBI Molecular Biology Resources

NCBI Molecular Biology Resources

139 Views Download Presentation
Download Presentation

NCBI Molecular Biology Resources

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. NCBI Molecular Biology Resources A Field Guide Part 2 January 12, 2007

  2. Web Access Text Entrez Sequence BLAST Structure VAST

  3. Genomic Biology

  4. Eukaryotic Projects

  5. Mammals with Genomes

  6. Links to Genomic Sequences Dog WGS project Dog chromosomes Armadillo WGS project

  7. Canis familiaris Genome Project

  8. Traditional GenBank Divisions; Keyword: WGS 12-digit Accessions (eg. AABC00000000) 424 Projects 299 Taxa 203 bacteria 91 eukaryotes 40 fungi 34 animals Whole Genome Shotgun Projects

  9. From WGS to RefSeq ?

  10. Chromosome 17 Assembly WGS contigs (NW) separated by gaps

  11. A WGS Contig (NW) WGS WGS WGS NW NW NW NC

  12. Let’s Do a Search! Search Gene for thyroid peroxidase (TPO) tpo [sym]ANDcanis familiaris [organism] [gene/protein name] (if [sym] doesn’t work) Only 1 record! Why not start all your NCBI searches this way?

  13. Gene Annotation on a Chromosome (NC) thyroid peroxidase (TPO) exons mRNA CDS protein

  14. Two Views of the Assembly NC_006599 Map Viewer NW WGS

  15. Links from Gene Homologene

  16. Beyond Gene If your organism is not in Gene… • WGS sequences in Entrez Nucleotide (wgs[prop]) • remember the armadillo! • UniGene : gene-based clusters of cDNAs and ESTs • Trace Archive

  17. What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes known genes and uncharacterized ESTs • Useful for gene discovery and selection of mapping reagents

  18. Organisms in UniGene Top Ten 1. Human 2. Mouse 3. Rat 4. Rice 5. Zebrafish 6. Wheat 7. Pig 8. Frog (X. tropicalis) 9. Corn 10. Chicken

  19. Finding UniGene Clusters by link human TPO by Entrez search

  20. UniGene Cluster for TPO

  21. Trace Archive Find the link under Hot Spots on the Home Page New query builder!

  22. Short-tailed opossum traces

  23. Viewing Simple Genomes All are RefSeq NC records in Entrez Genome • Full chromosomal sequences are provided • Genes are annotated • The annotation can be shown graphically and linked to sequence records

  24. NC_000913 in Entrez Genome Case sensitive! mutL

  25. NC_000913 in Entrez Genome Links to Entrez Gene and Protein

  26. Viewing Complex Genomes NCBI Map Viewer • Map Viewer Home Page • Shows all supported organisms • Provides links to genomic BLAST • Genome Overview Page • Provides links to individual chromosomes • Shows hits on a genome graphically • Chromosome Viewing Page • Allows interactive views of annotation details • Provides numerous maps unique to each genome

  27. Map Viewer Home Page

  28. Genome Overview Page Species-specific help! Search the maps Genomic BLAST

  29. Chromosome Viewing Page Zooming Controls Map Summary Add or remove maps Master Map with exploded content Genes UniGene Contigs Ideogram

  30. Map Summary Build 35 Build 36 TPO’s contig!

  31. Map Content Map content varies greatly by species! • Sequence Maps • Core assembly • Annotation evidence • Clones & Markers • Polymorphisms • Links & Features • Genetic Maps • Cytogenetic maps • Linkage maps • Radiation hybrid maps Assembly Contig Component Transcript Gene

  32. View the Assembly near TPO

  33. Assembly of Chr. 2 NT_022221 1255072 3507187

  34. Assembly of Chromosome 2 Links to Tools and Data Links to Entrez Gene Links to Entrez Nucleotide

  35. Why do we need similarity searching? Searching with Sequences • To identify and annotate sequences with… • incomplete (or no) annotations (GenBank) • incorrect annotations • To assemble genomes • To explore evolutionary relationships by… • finding homologous molecules • developing phylogenetic trees • NOTE: Similar sequences may NOT have similar function!

  36. Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and network clients

  37. Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

  38. Global vs. Local Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500 BLASTp

  39. GTACTGGACATGGACCCTACAGGAA Query: Word Size = 11 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ........... Minimum word size = 7 blastn default = 11 megablast default = 28 Make a lookup table of words

  40. GTQITVEDLFYNIATRRKALKN Query: Word Size = 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Word Size can be 2 or 3 (default = 3) Make a lookup table of words

  41. Initial Matches and Extensions Nucleotide BLAST requires one exact match ATCGCCATGCTTAATTGGGCTT <---CATGCTTAATT -----> exact word match one match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI <----SEIYYN ----> neighborhood words two matches

  42. An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

  43. Nucleotide vs. Protein BLAST Comparing ADSS from H. sapiens and A. thaliana aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc Human: N R V TV V L G A Q W G D E G + + V + V L G Q W G D E G A.th.: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches.

  44. P P P P P P P P P P P P N P P P P P P P P P P P P P Translated BLAST ucleotide rotein Particularly useful for nucleotide sequences without protein annotations, such as ESTs or genomic DNA Program Query Database P N blastx P N tblastn N N tblastx

  45. BLAST 2 Sequences (blastx) output: Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

  46. Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

  47. Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

  48. BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

  49. Gapped Alignments • Gapping provides more biologically realistic alignments • Statistical behavior is not completely understood for gapped alignments • Gapped BLAST parameters must be found by simulations for each matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

  50. Scores Simply add the scores for each pair of aligned residues V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11 Different matrices produce different scores!