1 / 19

Introduction to databases

Introduction to databases. Tuomas Hätinen. Topics. File Formats Databases Primary structure: UniProt Tertiary structure: PDB Database integration system Sequence retrieval system (eg SRS, Hands on session). File formats. Fasta. FASTA format is very common.

mya
Télécharger la présentation

Introduction to databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to databases Tuomas Hätinen

  2. Topics • File Formats • Databases • Primary structure: UniProt • Tertiary structure: PDB • Database integration system • Sequence retrieval system (eg SRS, Hands on session)

  3. File formats

  4. Fasta • FASTA format is very common. • Can be hand constructed when in a hurry • Straightforward way for storing multiple sequences – just concatenate FASTA files • Contents: • Line 1: > all identifiers and descriptors • Remaining lines: sequence >1NJR:A 32.1 KDA PROTEIN IN ADH3-RCA1 INTERGENIC REGION XTGSLNRHSLLNGVKKXRIILCDTNEVVTNLWQESIPHAYIQNDKYLCIHHGHLQSLXDS XRKGDAIHHGHSYAIVSPGNSYGYLGGGFDKALYNYFGGKPFETWFRNQLGGRYHTVGSA TVVDLQRCLEEKTIECRDGIRYIIHVPTVVAPSAPIFNPQNPLKTGFEPVFNAXWNALXH SPKDIDGLIIPGLCTGYAGVPPIISCKSXAFALRLYXAGDHISKELKNVLIXYYLQYPFE PFFPESCKIECQKLGIDIEXLKSFNVEKDAIELLIPRRILTLDL Example of FASTA sequence for PDB 1njr. Note X are ’any’ amino acid.

  5. SwissPROT, EMBL, TrEMBL, UniProt format • Each line begins with a 2 letter identifier • UniProt format closely resembles EMBL format except that considerably more information about physical and biochemical properties is provided

  6. SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/userman.html#linetypes

  7. SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/userman.html#linetypes

  8. Databases

  9. Key concepts • Experimental database • Contains experimental meassurements • E.g. EMBL, PDB • Derived database • Derived from experimental databases • E.g. UniProtKB • Database stability • Accession numbers • Non-redundancy • Annotation

  10. NCBI GenBank EMBL DDBJ EBI CIB Nucleic sequence databases – experimental data USA NIH EUROPE *Submissions *Updates *Submissions *Updates JAPAN NIG *Submissions *Updates EMBL

  11. Entrez NCBI Trans Trans EBI SRS Raw Protein sequence databases DNA sequences DBs Proteins seq DBs Sub/Up Sub/Up Gen Pept NIH Gen Bank PIR-PSD DDBJ TrEMBL UniPROT EMBL EMBL SwissPROT Sub/Up Sub/Up

  12. UniProt • Universal Protein Resource • Protein Sequence database • UniProt Consortium • European Bioinformatics Institute • Swiss Institute of Bioinformatics • PIR Georgetown University • Mission • Maintain high quality, stable, comprehensive, fully classified and annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces

  13. Organization of UniProt databases • UniProt Archive (UniParc) • All available protein sequences • UniProt Knowledgebase (UniProtKB) • Annotated proteins sequences • UniProt Reference Clusters (UniRef) • Reduced redundancy for faster searching

  14. Database size comparison

  15. UniProtKB • Annontated entries • UniParc =>UniProtKB • UniProt/TrEMBL • Automated annotation • UniProt/SwissProt • Manual annotation

  16. SWISSPROT • Started as part of a Phd thesis, first version released in 1986. Now a collaboration between Swiss Institute of Bioinformatics and EBI. • Rich source for protein sequence data • A well annotated source for sequences • Largely non-redundant • Updated daily, cross referenced with more than 30 different databases. • Let us view a sample entry

  17. TrEMBL • 1996: TrEMBL (Translation of EMBL) released • Computer-annotated entries derived from the translation of all coding sequences in EMBL database except those already in SWISS-PROT • complement to Swiss-Prot and sequence • Sequences included to Swissprot by annotators

  18. Errors in databases • Be aware of errors in the databases: • sequence errors: • genome projects’ error rate is 1/10,000 nts; • ESTs’ error rate is 1/100nts. • annotation errors: • Programs do not always give correct annotations. • SwissProt is a protein database curated and annotated manually by biologists. • Manual curation doe

  19. Errors in databases • Be aware of errors in the databases: • sequence errors: • genome projects’ error rate is 1/10,000nts; • ESTs’ error rate is 1/100nts. • annotation errors: • Automated computer programs do not always give correct annotations. • SwissProt is a protein database curated and annotated manually by biologists. • most reliable database, but is not up-to-date

More Related