1 / 107

Introducing Bioinformatics Databases

Introducing Bioinformatics Databases. Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine. Sources of Biological Knowledge. Past: textbooks, monographs, books, journals.

Télécharger la présentation

Introducing Bioinformatics Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducing Bioinformatics Databases Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine

  2. Sources of Biological Knowledge Past: textbooks, monographs, books, journals. Today: online accessible databasesKeyword searchable, e.g. Google. Every class of biological molecule has at least a few databases associated with it. Every area of biology, biotechnology, medicine and life science research will have some kind of database associated with it. Must be aware and familiar with MAJOR databases Must be able to discover NEW databases and master them as and when they appear.

  3. Biological knowledge today! • STORED digitallyAlmost critical biological data, information, knowledge is currently stored in computers • ACCESSIBLE globallyAll current critical biological knowledge is publicly accessible via the Internet network of computers • SHARED extensivelyMost research data is exchanged via the Internet today if not publicly and free, then shared among international collaborators • PUBLISHED onlineMost scientific journals are now published with a digital version accessible online, free open access or for a subscription fee paid by the individual or by the institution 10 years ago, this was not so. There has been tremendous change.

  4. UNSTOPPABLE DATA GROWTH 100 90 80 70 60 100 90 80 70 60 Growth of GenBank DNA Sequence (2005 – 2009) >100,000,000 sequences Exponential Increase Next Gen Sequencing Technologies Growth of PDBProtein and MacromolecularStructuresDriven by various Structural Genomics initiatives such as Protein Structure Initiative http://www.nigms.nih.gov/Initiatives/PSI JCSG http://www.jcsg.org/ http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100 2005 2008

  5. RELENTLESS INCREASE IN DATABASESMichael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942) http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1 A lot of data A lot of databases What do they mean? Most of the data begins to make sense if they are Integrated But many plans to integrate these databases have failed

  6. Biological Databases – examples and general considerations • Biological databases – what they are; purpose • Some general considerations • Sample databases

  7. Biological databases Many (but not all) definitions of “database” include: • Storage of data on a computer in an organized way • Provision for searching and data extraction. • By these definitions web pages, books, journal articles, text files, and spreadsheet files cannot be considered as databases Purposes of biological databases: • To disseminate biological data and information • To provide biological data in computer-readable form • To allow analysis of biological data

  8. But first…a few terms • Database Record: “A collection of related data, arranged in fields and treated as a unit. The data for each [item] in a database make up a record.”www.d.umn.edu/lib/reference/skills/vocab.html • Field: “the part of a record reserved for a particular type of data…”www.amberton.edu/VL_terms.htm

  9. Example from the “Grocery Shopping Database”: Date: 18/08/2006 Item: White bread Store: Dover Provision Price: $1.29 Fields A different view of the first “record”: A record Field Values

  10. Some features of Biological Databases • Data/information… • Stored in records according to some predetermined structure/format • +/- evidence • +/- unique identifiers • +/- additional annotation • +/- DB Xrefs (cross references)

  11. Authoritative and Reliable • Most biological databases are from authoritative and reliable sources, however… • Not all Websites and Databases are reliable. • Not all data and information stored in authoritative and reliable websites or databases are accurate or correct, or up-to-date • Nevertheless, most of them are useful and instructive • Many of them contain valuable information and knowledge Identification of authority and Evaluation of reliability – very important Every serious scientist must be critical of the information they read, whether online or not.

  12. Discoverability • Most publications, books and courses include online references – Web address (URL)e.g. http://www.pdb.org/ for protein structural data • Most useful resources are also listed and taught in courses, or spread by word of mouth. • Most databases are searchable by appropriate keywords and their authority determined by their web addresses, the institutions behind the databases or the authors’ reputation Most databases have full details of their content and how to use them.

  13. NAR Database Categories List From: http://nar.oxfordjournals.org

  14. TABLE OF NAR DATABASES ISSUE http://en.wikipedia.org/wiki/Biological_database http://www.oxfordjournals.org/nar/database/c/ • Nucleotide Sequence Databases • RNA sequence databases • Protein sequence databases • Structure Databases • Genomics Databases (non-vertebrate) • Metabolic and Signaling Pathways • Human and other Vertebrate Genomes • Human Genes and Diseases • Microarray Data and other Gene Expression Databases • Proteomics Resources • Other Molecular Biology Databases • Organelle databases • Plant databases • Immunological databases • Bibliographic databases

  15. Database of Biological Databases • Alphabetical order http://www.oxfordjournals.org/nar/database/a/ • Categoryhttp://www3.oup.co.uk/nar/database/cap/

  16. Human Genome Project – DNA sequence Microarray – RNA expression and levels Proteomics – protein expression and concentration in cells Structural proteomics or genomics – protein structure (and function) Functional genomics- protein function Information flow in Biology

  17. Examples of Major Bioinformatics Resources • Browsing databases • NCBI Entrez http://www.ncbi.nlm.nih.gov/sites/gquery • EBI Ensembl http://www.ensembl.org/index.html • Retrieving sequences • SRS - Sequence Retrieval System http://srs.ebi.ac.uk/ • ExPASy – Expert Protein Analysis System – Proteomics server • http://au.expasy.org/

  18. Bibliographic Information • PubMed and Medline • Recent National Institutes of Health USA policy • Google Scholar • Web of Science and Science Citation Index • Online journals • SuperTier Top Journals – Nature, Science, Cell, PNAS, etc. • Open access journals • Public Library of Science PLoS • Biomed Central

  19. Literature - PubMed • Citations and abstracts for articles from approx. 5000 (not all!) biomedical journals • Text searching to identify citations of interest • Links to full-text articles (free or otherwise) • More than 16,000,000 records* * 16000000 As of Dec 29 2005. PubMed News. http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=PubMedNews

  20. Literature –PubMed p53 cancer Authors Article Title Bibliographic Information (Journal name, date, volume, issue, page numbers) PMID: Unique ID for this record

  21. AbstractPlus view - PubMed

  22. STORING YOUR OWN BIBLIOGRAPHIC INFORMATIONOnline Wizfolio: http://www.wizfolio.comSoftware: ENDNOTE or REFMAN

  23. Genetic and Genomic Databases • From sequencing of specific genes or genomic sequence of entire genomes • Data are prepared, annotated and stored in databases • Genbank, NCBI • DDBJ, NIG • EBI/EMBL • Making Deposits http://www.ncbi.nlm.nih.gov/Genbank/update.html • Bankit • Sequin

  24. Nucleic Acid Databases Include: • GenBank • DDBJ • EMBL • RefSeq • Archives of Primary data • Exchange data amongst themselves Summary/Integration of primary data

  25. GenBank • Data from: • Individual laboratories • Sequencing centres • Any organism • Individual records may be incomplete or inaccurate • Eg: sequencing errors • Eg: incomplete sequences NCBI Handbook

  26. Searching Entrez Nucleotide for human p53

  27. p53 Genbank record: GI 48094186

  28. p53 Genbank record: HEADER Identifiers, Version, Definition Line Organismal Source Data sources

  29. p53 Genbank record: FEATURES Cross-References to Other DBs Protein product

  30. p53 Genbank record: SEQUENCE

  31. The linked protein record: GenBank  GenPept

  32. Links from p53 GenPept record Available links vary from one record to another

  33. With so many records how do we know which one to work with? They may: • Come from different source databases • eg DDBJ, GenBank, EMBL (nucleotide) • Have the same or different sequence information • Single changes in nucleotides/amino acids • Incomplete sequence • Have variable extra annotation • Eg: Signal peptide; domains; DB XRefs etc

  34. The RefSeq Project • Goal: a “comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.”http://www.ncbi.nlm.nih.gov/RefSeq/index.html • Info from: • Predictions from genomic sequence • Analysis of GenBank Records • Collaborating databases

  35. RefSeq:

  36. Example: p53 RefSeq mRNA record

  37. Example: p53 RefSeq mRNA record

  38. p53 RefSeq mRNA features

  39. p53 RefSeq mRNA features continued

  40. p53 RefSeq mRNA features continued

  41. p53 RefSeq mRNA features include… • Links: • GeneID – locus and display of genomic, mRNA and protein sequences; extensive additional annotation • OMIM – Online Mendelian Inheritance in Man – disease information • CDD – conserved protein domain • HGNC – official nomenclature for human genes • HPRD – Human Protein Reference Database • CDS (CoDing Sequence) • Gene Ontology terms applied to the protein • Nucleotide sequence range of translated product • Translation – the protein sequence • Link to RefSeq Protein record • Other features – sequence ranges refer to the nucleotide • Nuclear Localization Signal • Polyadenylation site etc

  42. p53 RefSeq Protein

  43. p53 RefSeq Protein continued

  44. p53 RefSeq Protein continued Sequence ranges in features refer to the amino acid sequence

  45. Interpreting RefSeq identifiers Genomic DNA • NC_123456 - complete genome, complete chromosome, complete plasmid • NG_123456 - genomic region • NT_123456 - genomic contig mRNA - NM_123456 Protein - NP_123456 Gene and protein models from genome annotation projects: • XM_123456 - mRNA • XR_123456 - RNA (non-coding transcripts) • XP_123456 - protein

  46. RefSeq status • Validated • Reviewed • Provisional --------------- • Predicted • Model • Inferred • Genome Annotation Most confident Least confident

  47. Protein Database – Swiss-Prot SWISS-PROT A curated database of protein sequences • Trained biologists extract and analyze relevant evidence from scientific publications • Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL  UniProtKB = Swiss-Prot + TrEMBL

  48. Protein Database – Swiss-Prot SWISS-PROT A curated database of protein sequences • Trained biologists extract and analyze relevant evidence from scientific publications • Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL  UniProtKB = Swiss-Prot + TrEMBL

  49. Structures: PDB • Three-dimensional structures of biomolecules Image: Eric Martz RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm (Accessed Aug 16, 2006)

More Related