1 / 101

Please use linux today if possible!

Please use linux today if possible!. Introduction to Molecular Biology Databases. Alinda Nagy & Hedi Hegyi, PhD @ Institute of Enzymology, Budapest The BioSapiens Permanent School of Bioinformatics. Budapest, Sept 4-8, 2006. Databases. What is a database?.

candie
Télécharger la présentation

Please use linux today if possible!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Please use linux today if possible!

  2. Introduction to Molecular Biology Databases Alinda Nagy & Hedi Hegyi, PhD @ Institute of Enzymology, Budapest The BioSapiens Permanent School of Bioinformatics Budapest, Sept 4-8, 2006

  3. Databases

  4. What is a database? • A database is a structured collection of information. (An organized array of information.) • A database consists of basic objects called records or entries. • Each recordconsists of fields, which hold defined data that is related to that record. • For example, a protein database would typically have proteins as records and protein properties as fields (i.e. name, length, sequence, taxonomical origin, etc.) Noam Kaplan

  5. What is a database? • A database is searchable (index) -> table of contents • A database is updated periodically (release) -> new edition • A database is cross-referenced (hyperlinks) > links with other db

  6. Why Databases? • The purpose of databases is not merely to collect and organize data, but mainly to allow advanced data retrieval. • A query is a method to retrieve information from the database. • The organization of each record into predetermined fields allows us to use queries on fields. • Example: Find all human proteins that are enzymes and have a length of 1000-1200 aa. Noam Kaplan

  7. Databases on the Internet • Biological databases often have a web interface, which allows the user to send queries to the database. • Some databases can be accessed by different web servers, each offering a different interface. request query result web page User Web server Database server Noam Kaplan

  8. Databases on the Internet Information system Query system Storage System Data Francis Ouellette

  9. - GenBank flat file - PDB file - Interaction Record- Title of a book - Book Databases on the Internet Information system Query system Storage System Data Francis Ouellette

  10. - Boxes - Oracle - MySQL - PC binary files - Unix text files - Bookshelves Databases on the Internet Information system Query system Storage System Data Francis Ouellette

  11. Databases on the Internet - A List you look at- A catalogue- indexed files- SQL- grep Information system Query system Storage System Data Francis Ouellette

  12. - The UBC library - Google - Entrez (NCBI) - SRS (Sequence Retrieval System) Databases on the Internet Information system Query system Storage System Data Francis Ouellette

  13. Database download • Nearly all biological databases are available for download as simple text files. • A local version of the database removes limitations on how you process the data. • Processing data in files requires some minimal computer-programming skills. • PERL is an easy programming language that can be used for extraction and analysis of data from files. Noam Kaplan

  14. Tour of the major molecular biology databases • There is a tremendous amount of information about biomolecules in publicly available databases. • Today, we will just look at some of the main databases and what kind of information they contain. • Exercises will give you a little practice at browsing databases.

  15. List of molecular biology databases

  16. List of molecular biology databases • Nucleic Acids Research publishes an annual database issue. The 2006 updateof the online Molecular BiologyDatabase Collectionincludes 858 databases • http://www3.oup.co.uk/nar/database/c/

  17. Large Growth in the Number of Biological Databases NAR Database Issue

  18. Molecular biology data types Mouse chromosome X from the Mouse Genome Informatics project http://www.informatics.jax.org/ Organisms Genome maps Lei Liu

  19. Molecular biology data types Organisms Genome maps DNA sequences RNA sequences ...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Lei Liu

  20. Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Lei Liu

  21. Molecular biology data types Organisms Genome maps DNA sequences RNA structures RNA sequences Protein sequences Protein structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen Lei Liu

  22. Molecular biology data types Organisms Genome maps DNA motifs DNA sequences RNA expression RNA structures RNA sequences Protein sequences Protein structures Protein motifs Lei Liu

  23. Types of molecular biology databases 14 main NAR categories: Nucleotide Sequence RNA sequence Protein sequence Structure Genomics (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Proteomics Resources Other Organelle Plant Immunological

  24. Resources are Becoming More Diverse NAR – Database Categories 2004 2006

  25. NAR 2006 – A Closer Look • Genome scale databases have proliferated • Traditional sequence databases are now a small part • Databases around new specific data types are emerging • Pathway and disease orientated databases are emerging

  26. Database searches

  27. Using a database • How to get information out of a database: • Summaries: how many entries, average or extreme values • Browsing: no targeted information to retrieve • Search: looking for particular information • Searching a database: • Must have a key that identifies the element(s) of the database that are of interest. • Name of gene • Sequence of gene • Other information Larry Hunter

  28. Searching sequence databases • Start from sequence, find information about it • Many kinds of input sequences • Could be amino acid or nucleotide sequence • Genomic or mRNA/cDNA or protein sequence • Complete or fragmentary sequences • Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar sequences. • Both small (mutations) and large (required for function) differences within “similar” can be interesting. Larry Hunter

  29. What might we want to know about a sequence? • Is this sequence similar to any known genes? How close is the best match? Significance? • What do we know about that gene? • Genomic (chromosomal location, allelic information, regulatory regions, etc.) • Structural (known structure? structural domains? etc.) • Functional (molecular, cellular & disease) • Evolutionary information: • Is this gene found in other organisms? • What is its taxonomic tree? Larry Hunter

  30. What can be discovered about a gene by a database search? • A little or a lot, depending on the gene • Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. • Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. • Structural information: associated protein structures, fold types, structural domains • Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. • Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases Larry Hunter

  31. NCBI and Entrez

  32. NCBI and Entrez • One of the most useful and comprehensive sources of databases is the NCBI (National Center for Biotechnology Information), part of the NIH (National Institute of Health). • NCBI provides interesting summaries, browsers for genome data, and search tools • Entrez is their database search interfacehttp://www.ncbi.nlm.nih.gov/Entrez • Can search on gene names, sequences, chromosomal location, diseases, keywords, ... Larry Hunter

  33. BLAST: Searching with a sequence • Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are homologous). • Can start with nucleotide or amino acid sequence, and search for either (or both) • Many options • E.g. ignore low information (repetitive) sequence, set significance critical value • Defaults are not always appropriate: READ THE NCBI EDUCATION PAGES! Larry Hunter

  34. Major choices: • Translation • Database • Filters • Restrictions • Matrix Larry Hunter

  35. Larry Hunter

  36. Larry Hunter

  37. Close hit: Rat ADH alpha Larry Hunter

  38. Distant hit:Human sorbitol dehydrogenase Larry Hunter

  39. Parameters (at bottom!) Larry Hunter

  40. Click on: Larry Hunter

  41. Larry Hunter

  42. BLAST searches online • http://www.ncbi.nlm.nih.gov/BLAST/ • Sequences: >ENSP00000002501 pep:known chr:NCBI36:16:88598804:88613382 MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAPGLLQVTERRQ PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHPLPRAGYLRSP SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED >ENSP00000314902 chr:18 gene:ENSG00000176890 tr:ENST00000323250 MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDRTGTGTLSVFG MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALPPCHALCQFYV VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLGDAHIYLNHIE PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV

  43. BLAST output for ENSP00000002501

  44. BLAST output for ENSP00000002501

  45. BLAST output for ENSP00000314902

  46. BLAST output for ENSP00000314902

  47. Take home messages • There are a lot of molecular biology databases, containing a lot of valuable information • Not even the best databases have everything (or the best of everything) • These databases are moderately well cross-linked, and there are “linker” databases • Sequence is a good identifier, maybe even better than gene name! Larry Hunter

  48. Protein sequence databases • General sequence databases(e.g. UniProt) • Protein properties (e.g. PFD – Protein Folding Database) • Protein localization and targeting • (e.g. NPD - Nuclear Protein Database) • Protein sequence motifs and active sites • (e.g. BLOCKS,InterPro, PROSITE, PRINTS) • Protein domain databases; protein classification • (e.g. InterPro, ProDom, SMART, Pfam) • Databases of individual protein families • (e.g. Histone Database) • http://www3.oup.co.uk/nar/database/cat/1

More Related