400 likes | 558 Vues
Genomics and Personalized Health Care Databases. Bailee Ludwig Quality Management . Molecular Biology Databases. Excellent means of storing a vast amount of Information in a central , sharable location
E N D
Genomics and Personalized Health CareDatabases Bailee Ludwig Quality Management
Molecular Biology Databases • Excellent means of storing a vast amount of Information in a central, sharable location • Biological databases are designed especially for the proper storing, searching & retrieving biological data • Keyword Searches • Cross-Referencing • 3D capabilities
Database Categories • Categories • Nucleotide Sequence Databases • Gene Databases • Genome Databases • Protein Sequence Databases • Structure Databases • Metabolic and Signaling Pathways • Human Genes and Diseases • Microarray Data and other Expression Databases • … • Each contains specific information • Each is interrelated
National Center for Biotechnology Information (NCBI) • Created as a part of National Library of Medicine in 1988 • Establish public databases • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information • Databases • Sequence, such as GeneBank, RefSeq, dbSNP • Literature, such as PubMed, OMIM • Tools • Entrez. Blast, Cn3D, etc.
Let’s Check out NCBI • http://www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar
GenBank http://www.ncbi.nlm.nih.gov/Genbank/
GenBank • Nucleotide only sequence database • GenBank Data • Direct submissions individual records (BankIt, Sequin) • Batch submissions via email (EST, GSS, STS) • ftp accounts established for sequencing centers • Data shared nightly amongst three collaborating databases: • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL)
GeneBank Release 175.0 • ftp://ftp.ncbi.nih.gov/genbank/ • Full release every two months • Incremental and cumulative updates daily • Release 175.0 (12/15/2009) • 112,910,950 Sequences • 110,118,557,163 Bases
GenBank Record (Sequence) ORIGIN 1 aaaaagagaaactgttgggagaggaatcgtatctccatatttcttctttcagccccaatc 61 caagggttgtagctggaactttccatcagttcttcctttctttttcctctctaagccttt 121 gccttgctctgtcacagtgaagtcagccagagcagggctgttaaactctgtgaaatttgt 181 cataagggtgtcaggtatttcttactggcttccaaagaaacatagataaagaaatctttc 241 ctgtggcttcccttggcaggctgcattcagaaggtctctcagttgaagaaagagcttgga 301 ggacaacagcacaacaggagagtaaaagatgccccagggctgaggcctccgctcaggcag 361 ccgcatctggggtcaatcatactcaccttgcccgggccatgctccagcaaaatcaagctg 421 ttttcttttgaaagttcaaactcatcaagattatgctgctcactcttatcattctgttgc 481 cagtagtttcaaaatttagttttgttagtctctcagcaccgcagcactggagctgtcctg 541 aaggtactctcgcaggaaatgggaattctacttgtgtgggtcctgcacccttcttaattt 601 tctcccatggaaatagtatctttaggattgacacagaaggaaccaattatgagcaattgg 661 tggtggatgctggtgtctcagtgatcatggattttcattataatgagaaaagaatctatt 721 gggtggatttagaaagacaacttttgcaaagagtttttctgaatgggtcaaggcaagaga 781 gagtatgtaatatagagaaaaatgtttctggaatggcaataaattggataaatgaagaag 841 ttatttggtcaaatcaacaggaaggaatcattacagtaacagatatgaaaggaaataatt 901 cccacattcttttaagtgctttaaaatatcctgcaaatgtagcagttgatccagtagaaa 961 ggtttatattttggtcttcagaggtggctggaagcctttatagagcagatctcgatggtg
RefSeq • Database of reference sequences • http://www.ncbi.nlm.nih.gov/RefSeq/ • Curated • Many experimentally validated • Some partially validated via ESTs • Some computationally predicted • Non-redundant; one record for each gene, or each splice variant, from each organism represented
Accession Numbers • DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data • RefSeq provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. • RefSeq identifiers include the following formats: • Complete chromosome NC_###### • Genomic contig NT_###### • mRNA (DNA format) NM_###### • Protein NP_######
Accession Numbers: More Examples AC_123456 Genomic Alternate complete genomic AP_123456 Protein Protein products; alternate NG_123456 Genomic Incomplete genomic regions NR_123456 RNA Non-coding transcripts NW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun data XM_123456 mRNA Transcript products XP_123456 Protein Protein products XR_123456 RNA Transcript products YP_123456 Protein Protein products ZP_12345678 Protein Protein products
EST • Expressed Sequence Tags database (dbEST) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organisms • http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucest&cmd=search&term=
EST • mRNA: Genomic regions actively transcribed in cell • cDNA (complementary DNA) • Copy of mRNA using mRNA as a template • Sequence is complementary to mRNA • EST: Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence) • Partial cDNA sequence • Can be 5’ or 3’ • Typical size: 200 - 500 bp • Represents mRNA actively transcribed in cell • Use to identify • Genes; Alternative splicing; etc.
Access to dbEST Data • EST sequences are included in the EST division of GenBank, available from NCBI by anonymous ftp and through Entrez • The nucleotide sequences may be searched using the BLAST server • The TBLASTN program which takes an amino acid query sequence and compares it with six-frame translations of dbEST DNA sequences is particularly useful. • EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the /repository/dbEST directory at ftp.ncbi.nih.gov
UniGene • www.ncbi.nlm.nih.gov/UniGene • Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene) • In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included. • UniGene may be of use as a resource for gene discovery. • UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis.
Numbers of UniGene Entries • Bostaurus (cow) 42,843 • Canis lupus familiaris (dog) 27,853 • Equuscaballus (horse) 8,348 • Homo sapiens (human) 123,396 • Musmusculus (mouse) 78,289 • Ovisaries (sheep) 18,814 • Rattusnorvegicus (Norway rat) 63,434 • Susscrofa (pig) 51,576 • Daniorerio (zebrafish) 51,481
UniGene • UniGene is a useful tool to look up information about expressed genes • UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression
Now… Let’s Give these databases a closer look with a Lab