1 / 67

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources. part 2. Nov 6, 2001. WWW BLAST. Web BLAST. Protein Databases. nr Non-redundant GenBank CDS translations+PDB+ SwissProt+SPupdate+PIR 567,860 sequences; 178,533,065 letters swissprot Non-redundant SwissProt sequences

jana
Télécharger la présentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI Molecular Biology Resources part 2 Nov 6, 2001

  2. WWW BLAST

  3. Web BLAST

  4. Protein Databases nr Non-redundant GenBank CDS translations+PDB+ SwissProt+SPupdate+PIR 567,860 sequences; 178,533,065 letters swissprot Non-redundant SwissProt sequences 88,934 sequences; 32,001,993 letters pdb PDB protein sequences 22,726 sequences; 5,068,254 letters

  5. Nucleotide Databases nr(nt) GenBank+EMBL+DDBJ+PDB sequences 759,631 sequences; 2,714,918,430 letters dbest Expressed Sequence Tags (EST Division)7,309,361sequences; 3,100,444,103letters htgs High-Throughput Genome Sequences (HTG Division) 84,374 sequences; 4,355,661,355 letters

  6. Protein BLAST Page Identifier or sequence >Mutated in Colon Cancer IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILER VQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSS DKVYAHQMVRTDSREQKLDAFLQPLSKPLSS swissprot Protein database

  7. BLAST Formatting Page

  8. BLAST Output: Graphic mouse over

  9. BLAST Output: Descriptions Sorted by E value Link to record in Entrez Taxonomy Reports Score E Sequences producing significant alignments: (bits) Value sp|P40692|MLH1_HUMAN MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH REPAIR... 255 2e-68 sp|P38920|MLH1_YEAST MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH REPAIR... 67 2e-11 sp|P44494|MUTL_HAEIN DNA MISMATCH REPAIR PROTEIN MUTL 52 6e-07 sp|P23367|MUTL_ECOLI DNA MISMATCH REPAIR PROTEIN MUTL 45 8e-05 sp|P14161|MUTL_SALTY DNA MISMATCH REPAIR PROTEIN MUTL 43 2e-04 sp|P49850|MUTL_BACSU DNA MISMATCH REPAIR PROTEIN MUTL 38 0.006 sp|P14160|HEXB_STRPN DNA MISMATCH REPAIR PROTEIN HEXB 38 0.010 sp|P70754|MUTL_AQUPY DNA MISMATCH REPAIR PROTEIN MUTL 36 0.031 sp|P54280|PMS1_SCHPO DNA MISMATCH REPAIR PROTEIN PMS1 35 0.069 sp|O67518|MUTL_AQUAE DNA MISMATCH REPAIR PROTEIN MUTL 35 0.069 sp|P54278|PMS2_HUMAN PMS1 PROTEIN HOMOLOG 2 (DNA MISMATCH REPAIR... 33 0.20 sp|P54279|PMS2_MOUSE PMS1 PROTEIN HOMOLOG 2 (DNA MISMATCH REPAIR... 33 0.27 sp|P54277|PMS1_HUMAN PMS1 PROTEIN HOMOLOG 1 (DNA MISMATCH REPAIR... 31 1.0 sp|P02239|LGB1_LUPLU LEGHEMOGLOBIN I 28 6.8 sp|P14242|PMS1_YEAST DNA MISMATCH REPAIR PROTEIN PMS1 28 9.0 TaxBLAST 2 x 10-68 e value cut-off = 10

  10. TaxBLAST: Taxonomy Reports Homo sapiens (human) [mammals] taxid 9606 sp|P40692|MLH1_HUMAN MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH ... 14590.0 sp|P54278|PMS2_HUMAN PMS1 PROTEIN HOMOLOG 2 (DNA MISMATCH ... 168 3e-41 sp|P54277|PMS1_HUMAN PMS1 PROTEIN HOMOLOG 1 (DNA MISMATCH ... 132 2e-30 Saccharomyces cerevisiae (baker's yeast) [fungi] taxid 4932 sp|P38920|MLH1_YEAST MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH ... 487 5e-137 sp|P14242|PMS1_YEAST DNA MISMATCH REPAIR PROTEIN PMS1 152 2e-36 Escherichia coli [enterobacteria] taxid 562 sp|P23367|MUTL_ECOLI DNA MISMATCH REPAIR PROTEIN MUTL 208 3e-53 Haemophilus influenzae [g-proteobacteria] taxid 727 sp|P44494|MUTL_HAEIN DNA MISMATCH REPAIR PROTEIN MUTL 208 4e-53 Salmonella typhimurium [enterobacteria] taxid 602 sp|P14161|MUTL_SALTY DNA MISMATCH REPAIR PROTEIN MUTL 200 9e-51 Streptococcus pneumoniae [low GC Gram+] taxid 1313 sp|P14160|HEXB_STRPN DNA MISMATCH REPAIR PROTEIN HEXB 189 1e-47 Bacillus subtilis [low GC Gram+] taxid 1423 sp|P49850|MUTL_BACSU DNA MISMATCH REPAIR PROTEIN MUTL 187 7e-47 Rickettsia prowazekii [a-proteobacteria] taxid 782 sp|Q9ZC88|MUTL_RICPR DNA MISMATCH REPAIR PROTEIN MUTL 178 3e-44

  11. BLAST Output: Alignments >sp|P40692|MLH1_HUMAN MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH REPAIR PROTEIN MLH1) Length = 756 Score = 255 bits (645), Expect = 2e-68 Identities = 126/140 (90%), Positives = 126/140 (90%) <alignment edited for brevity> Query: 61 RMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDAFLQPL 120 RMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDAFLQPL Sbjct: 341 RMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPL 400 low complexity sequence filtered >sp|P23367|MUTL_ECOLI DNA MISMATCH REPAIR PROTEIN MUTL Length = 615 Score = 44.5 bits (103), Expect = 8e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%) Query: 4 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 54 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ L Sbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

  12. Results from nr Score E Sequences producing significant alignments: (bits) Value gb|AAA85687.1| (U17857) hMLH1 gene product [Homo sapiens] 238 2e-62 gb|AAA17374.1| (U07418) human homolog of E. coli mutL gene ... 238 2e-62 ref|NP_000240.1| mutL homolog 1 >gi|730028|sp|P40692|MLH1_H... 238 2e-62 gb|AAB38506.1| (U80054) mismatch repair protein [Rattus nor... 217 4e-56 gb|AAF64514.1|AF250844_1 (AF250844) MutL homolog 1 protein ... 217 4e-56 gb|AAF59117.1| (AE003838) Mlh1 gene product [Drosophila mel... 129 1e-29 gb|AAC19117.1| (AF068257) mutL homolog [Drosophila melanoga... 129 1e-29 emb|CAA10163.1| (AJ012747) MLH1 protein [Arabidopsis thalia... 84 4e-16 emb|CAB66448.1| (AL136536) putative DNA mismatch repair pro... 73 1e-12 ref|NP_013890.1| MutL homolog, forms a complex with Pms1p a... 72 2e-12 gb|AAA16835.1| (U07187) Mlh1p [Saccharomyces cerevisiae] 71 4e-12 sp|P44494|MUTL_HAEIN DNA MISMATCH REPAIR PROTEIN MUTL >gi|1... 55 2e-07 gb|AAB09596.1| (U71053) DNA mismatch repair protein [Thermo... 49 1e-05 pir||H72427 DNA mismatch repair protein - Thermotoga mariti... 49 1e-05 >ref|NP_000240.1| mutL homolog 1 sp|P40692|MLH1_HUMAN MUTL PROTEIN HOMOLOG 1 (DNA MISMATCH REPAIR PROTEIN MLH1) pir||S43085 DNA mismatch repair protein MLH1 - human gb|AAC50285.1| (U07343) hMLH1 [Homo sapiens] gb|AAA82079.1| (U40978) DNA mismatch repair protein homolog [Homo sapiens] prf||2007430A DNA mismatch repair protein [Homo sapiens] Length = 756 Score = 238 bits (601), Expect = 2e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

  13. tblastn Results Against ESTs gb|N32729|N32729 yx75d09.r1 Homo sapiens cDNA clone 267569 5' similar to SW:MLH1_HUMAN P40692 MUTL PROTEIN HOMOLOG 1 ;. Length = 537 Score = 221 bits (557), Expect(2) = 4e-60 Identities = 120/146 (82%), Positives = 122/146 (83%), Gaps = 3/146 (2%) Frame = +3 Query: 384 VRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDISSGRARQQDEEMLELPAPAEVAAK 443 VRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDISSGRARQQDEEMLELPAPAEVAAK Sbjct: 3 VRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDISSGRARQQDEEMLELPAPAEVAAK 182 Query: 444 NQSLEGDTTKGTSEMSEKRGPTSSNPRKRHRXXXXXXXXXXXXRKEMTAACTPRRRIINL 503 NQSLEGDTTKGTSEMSEKRGPTSSNPRKRHR RKEMTAACTPRRRIINL Sbjct: 183 NQSLEGDTTKGTSEMSEKRGPTSSNPRKRHREDSDVEMVEDDSRKEMTAACTPRRRIINL 362 Query: 504 TSVLSL-QEEINEQG--HEVLREMLHNHS 529 T + QEEIN G + LHNHS Sbjct: 363 T*CFGVSQEEIN*AGXMRVLPGRXLHNHS 449 Score = 35.2 bits (79), Expect(2) = 4e-60 Identities = 14/23 (60%), Positives = 16/23 (68%) Frame = +1 Query: 533 CVNPQWALAQHQTKLYLLNTTKL 555 C +P WAL QH T+ L NTTKL Sbjct: 463 CESPSWALEQHPTQFXLFNTTKL 531

  14. Results against PDB - Finding a model template Sequences producing significant alignments: (bits) Value pdb|1B62|A Chain A, Mutl Complexed With Adp 45 1e-05 pdb|1BKN|A Chain A, Crystal Structure Of An N-Terminal 40kd..45 1e-05 pdb|1B63|A Chain A, Mutl Complexed With Adpnp 43 4e-05 pdb|2GDM| Leghemoglobin (Oxy) >gi|999936|pdb|1GDJ| Leg..27 2.0

  15. Cn3D BLAST Alignment Alignment by BLAST 2 Sequences

  16. PSI-BLAST Confirming relationships of purine nucleotide metabolism proteins

  17. >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY PSI BLAST

  18. PSI RESULTS: Initial BLAST Run

  19. First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

  20. Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme

  21. >gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4 MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHV IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK [GA]xxxxGK[ST] PHI BLAST

  22. Conserved Domain Search >gi|7290263|gb|AAF45724.1| CG3954 gene product [alt 2] [Drosop MSSRRWFHPTISGIEAEKLLQEQGFDGSFLARLSSSNPGAFTLSVRRGNEVTHIKIQNNGDF FDLYGGEKFATLPELVQYYMENGELKEKNGQAIELKQPLICAEPTTERWFHGNLSGKEAEKL ILERGKNGSFLVRESQSKPGDFVLSVRTDDKVTHVMIRWQDKKYDVGGGESFGTLSELIDHY KRNPMVETCGTVVHLRQPFNATRITAAGINARVEQLVKGGFWEEFESLQQDSRDTFSRNEGY KQENRLKNRYRNILPYDHTRVKLLDVEHSVAGAEYINANYIRLPTDGDLYNMSSSSESLNSS VPSCPACTAAQTQRNCSNCQLQNKTCVQCAVKSAILPYSNCATCSRKSDSLSKHKRSESSAS SSPSSGSGSGPGSSGTSGVSSVNGPGTPTNLTSGTAGCLVGLLKRHSNDSSGAVSISMAERE Drosophila Corkscrew CDS from genome

  23. CDD Results Score E Sequences producing significant alignments: (bits) value gnl|Pfam|pfam00102 Y_phosphatase, Protein-tyrosine phosphatase 236 3e-63 gnl|Pfam|pfam00102 Y_phosphatase, Protein-tyrosine phosphatase 55.4 1e-08 gnl|Smart|DSPc Dual specificity phosphatase, catalytic domain 236 3e-63 gnl|Smart|DSPc Dual specificity phosphatase, catalytic domain 70.2 4e-13 gnl|Smart|PTPc Protein tyrosine phosphatase, catalytic domain 102 9e-23 gnl|Smart|PTPc_DSPcProtein tyrosine phosphatase, catalytic domain, un...102 9e-23 gnl|Smart|SH2 Src homology 2 domains; Src homology 2 domains bi...88.2 1e-18 gnl|Smart|SH2 Src homology 2 domains; Src homology 2 domains bind 76.9 4e-15 gnl|Pfam|pfam00017 SH2, Src homology domain 2 78.0 2e-15 gnl|Pfam|pfam00017 SH2, Src homology domain 2 71.8 1e-13

  24. Specialized BLAST Pages Microbial Genomes Human Genome Trace Archive

  25. >APE0122 MVGVFGRLSRHVWVKRWYSILWAPWRMKYIKQAGSREGCVFCEAPSMGDDAKAYNSGHIMVTPYRH VAELEDLTMDEIVEMAKLVRASVKALKRVYAPHGFNIGVNVPRWRGDSNFMLTVGGTKVIPESLED TFKKLKPAVEEEARKEGV Microbial Genomes BLAST

  26. Hits to Unfinished Genome

  27. Human Genome BLAST >gi|11877232|emb|AJ289857.1|HSA289857 Homo sapiens mRNA for adracalin (ADRACALA AATCTAGCCCGGGAACCGAGTTGCGGGAGTGCGGTCTGTGCCGTTCCGGCCAGGAGTTTGCCGACTGCAG ACGTCCTGCGAACCGGCAAGATGTGCTCTCTGGGGTTGTTCCCTCCTCCACCGCCTCGGGGTCAAGTCAC CCTATATGAGCACAATAACGAGCTGGTGACGGGCAGTAGCTATGAGAGCCCGCCCCCCGACTTCCGGGGC CAGTGGATCAATCTTCCTGTCCTACAACTGACAAAGGATCCCCTAAAGACCCCTGGAAGGCTGGACCATG GCACAAGAACTGCCTTCATCCATCACCGGGAGCAAGTGTGGAAGAGATGCATCAACATTTGGCGTGATGT GGGCCTTTTTGGGGTGCTAAATGAAATTGCAAACTCAGAAGAAGAGGTGTTTGAGTGGGTGAAGACGGCA TCCGGCTGGGCCCTGGCACTCTGTCGATGGGCCTCTTCCCTCCATGGGTCCCTGTTCCCCCATCTGTCTC

  28. Genomic Context of BLAST Hits

  29. Shotgun Reads

  30. Trace Archive BLAST - MEGABLAST >gi|563511|emb|X81593.1|MMFKHN M.musculus mRNA for winged CAGACGGTCGGAGCTCCTGGCCCCCCAGACCCAGGCCCCCACGCCGACCTGCTTCAC TTCTTCGAGGCCAGGACTGGGTGATGGTGTCGCTACTCCCTCCGCAGTCTGACGTCA CACCCGACTGGAGGGCGAACCCCAAGGGGACCTCATGCAGGCTCCGGGCCTCCCAGA CAGAACAAGCATGCTAACTTCAGCTGCTCGTCGTTTGTGCCTGACGGCCCTCCAGAG

  31. Whole Genome Shotgun

  32. NCBI Genomic Resources Microbial Genomes The Draft Human Genome

  33. Viruses >650 Archaea 11 Bacteria 50 Eukaryotae 1 Microbial Genomes in GenBank Sept. 26 2001

  34. Bacterial Genomes

  35. M. tuberculosis Complete Genome

  36. Coding Regions

  37. Genome Annotations

  38. M. tuberculosisvs.E.coli COGS

  39. Complex Genomes in GenBank • Caenorhabditus elegans • Drosophila melanogaster • Homo sapiens • Arabidopsis thaliana

  40. The Human Genome The NCBI annotation effort

  41. The Draft Human Genome

  42. Human Genome Resources LocusLink: a central resource Human Genome BLAST Human Maps UniGene: Expressed Sequences

  43. What Data is Available? • NCBI assembled annotated genomic contigs • Genome project data • Other primary data • Reference sequences - mRNA, proteins, transcripts • Genome Scan gene models • Mapped variation data • Integrated maps - RH, genetic, cytogenetic, and sequence • Clustered and mapped expressed sequences • Links to outside data sources

  44. Type of query Resource Sequence Similarity Human genome BLAST Gene name LocusLink Map Location Map Viewer Database ID UniGene How to access it?

  45. UniGene HomoloGene PubMed inositolpolyphosphate 1 phosphatase Map Viewer OMIM Full report Available for Hs human Mm mouse Rn rat Dr zebrafish Dm fruit fly RefSeq GenBank Accessions dbSNP LocusLink • A single query interface to … • Sequences • - RefSeqs • - GenBank • Maps – the Human Genome Map • - RH • - Cytogenetic • -Assembled Genomic Sequence • Genome annotations • Entrez links

  46. What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes well-characterized genes and novel ESTs • Useful for gene discovery and selection of mapping reagents http://www.ncbi.nlm.nih.gov/UniGene/

  47. EST hits INPP1 mRNA INPP1 mRNA

  48. Hs UniGene Statistics 67,109 mRNAs + gene CDSs 1,145,547 EST, 3'reads 1,088,566 EST, 5'reads + 631,105 EST, other/unknown ---------- 2,932,237 total sequences in clusters Final Number of Clusters (sets) =============================== sets total 20,200 sets contain at least one known gene 95,289 sets contain at least one EST 19,010 sets contain both genes and ESTs UniGene Build 140 Sept 17th, 2001 96,479 80% uncharacterized transcripts

  49. UniGene Collections Sept 26, 2001 Sequences Clusters Animals Homo sapiens human 2,932,237 96,479 Mus musculus mouse 1,825,043 89,242 Rattus norvegicus rat 298,003 59,265 Danio rerio zebrafish 56,938 10,642 Bos taurus cow 87,310 7,367 Xenopus laevis frog 58,133 11,984 Plants Arabidopsis thaliana thale cress 131,068 25,997 Oryzia sativa rice 47,841 12,836 Triticum aestivum wheat 31,826 2,744 Hordeum vulgare barley 34,812 4,041 Zea mays maize (corn) 69,231 7,161

  50. Cluster Hs.32309Links and Homology

More Related