1 / 31

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core. An introduction to DNA and Protein Sequence Databases. Questions to address . What are the main sequence databases? Which one to use for: Looking up a gene name/identifier from a paper Identifiers What should I use and why?

obelia
Télécharger la présentation

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases

  2. Questions to address • What are the main sequence databases? • Which one to use for: • Looking up a gene name/identifier from a paper • Identifiers • What should I use and why? • Coordinate based systems • Annotation • Protein domains • Gene Ontology

  3. Database Varieties • Sequence Warehouses • “everything under one roof” • Genome Databases • Containing single genome dataset(s) • Reference Sets • Often human curated, the 'standard' for a particular gene or protein from which variants are defined • Specialist • Short reads from next generation sequencing (Short read archive) • [EST] Expressed sequence tags and [GSS] Genome survey sequence

  4. Sharing primary data NCBI GenBank EMBL DDBJ

  5. NCBI • Warehouse • GenBank <live demo> • NR dataset : NR = non redundant (but is is not..) • Reference Dataset • RefSeq • Genome Datasets • NCBIGenomes

  6. EMBL • Warehouse • EMBL • Historically • Protein set was call translated EMBL (trEMBL) • Gold standard reference set was called SwissProt • Reference set = Uniprot • UniProtKB/Swiss-Prot • Manually annotated and reviewed • UniProtKB/TrEMBL • automatically annotated and not reviewed • Genome database • Ensembl <live demo>

  7. Live Demo • Search GenBank for human adh4 • How many are there? • How many should there be? • Why are some different to those found in Uniprot? • Are there better databases to use? • Which identifier should you use in your lab book?

  8. We should now be able to answer these: • What are the main sequence databases? • Which one to use for: • Looking up a gene identifier from a paper • Searching for a gene name • Searching for an orthologus genes from another species

  9. Or what to write in your lab book Identifiers

  10. How to identify a feature • Gene/protein name • Common name • Standardised Name • Database identifier • Unique for each database • Some have revision numbers • Position in genome • Dependant on Genome build • Position in a Gene/Protein • Protein Domains

  11. Never use common namesExample of EPHB2

  12. Consortia identifiers • Most key species have a consortia / group / community that provides the key identifiers in the field • Humans • Was HUGO (HUman Genome Organisation) • now the HGNC (Human Genome Nomenclature Committee)

  13. Database Identifiers • Every dataset has their own system of identifying gene/protein • Example: Human ADH4 • Ensembl • ENSG00000198099 ENST00000423445 ENSP00000397939 • SwissProt • ADH4_HUMAN P08319  • RefSeq • NM_000670.3 NP_000661.2  • GenBank • gi|71565152|ref|NP_000661.2|

  14. Keeping Track of Changes • Gene models can change • Will the id you used yesterday still get the same sequence today? • Or: How to you get the latest version of a sequence?

  15. Keeping Track of Changes • Genbank: GI or “genbank identifier” • Gi number changes each time, often removed when it gets superseded • SwissProt: Accession and ID • Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) • RefSeq and Ensembl • Revision based ids • NM_000670.3 ENSG00000198099.1 • XXX.number • XXX always retrieve latest • XXX.number retrieves the version

  16. Demo: Retrieving old data

  17. Demo: Ensembl Definining: Chromosome coordinates

  18. Chromosome Positions • Features identified by Chromosome & position • File formats: BED, WIG, gff .. • All major genome databases store features as coordinates • Ubiquitous in deep sequencing studies • Note: coordinates change depending on the assembly • Always note the build number of the genome assembly if you are using coordinates

  19. Coordinates • New concept of PATCH • This is an assembly update without changing the primary sequence • However additional 'improved' contigs map to the reference • These will be in the net assembly: you may wish to use them • Genome assembly names can differ by institution but are the same underlying sequence: • GenBank/UCSC • DEMO liftOver

  20. Protein Domains: Protein Positions

  21. Protein Domains • Interpro • Site that stores information on known protein domains from different projects • Covered by Interpro • Similarities between proteins • Conserved region in an alignment • Conserved protein folds • Not Covered by Interpro • Predicted features on primary protein sequence • Trans-membrane regions • Low complexity regions • Phosphorylation sites

  22. Domain Complexity Many different types of domains x Many different projects identifying them = Vast amounts of domain based data

  23. Old way of interacting with a database Request information Retrieve information From single source

  24. Distributed Annotation

  25. DAS clients • Different type of software can have a DAS client build-in • Genome Browsers: ensembl, IGB, IGV.. • Multiple Alignment editors: Jalview, STRAP • 3D Structures: Spice • 3D electron microscopy data: PeppeR Demo

  26. Annotation

  27. Annotation • Problem: Many ways to name a gene • Reductase = oxidase = dehydrogenase • Gene Ontology Consortium [GO] • GO terms standardise naming • Note that errors may still occur in the assignment of terms • Found in RefSeq, UniProt and most genome databases • GO browsers e.g. AmiGO

  28. Gene Ontology • all [535063 gene products] • GO:0008150 : biological_process • [404412 gene products] • GO:0005575 : cellular_component • [372379 gene products] • GO:0003674 : molecular_function • [436597 gene products]

  29. Gene Ontology: acyclical Tree

  30. Evidence Codes • Experimental • # EXP: Inferred from Experiment # IDA: Inferred from Direct Assay • # IPI: Inferred from Physical Interaction # IMP: Inferred from Mutant Phenotype • # IGI: Inferred from Genetic Interaction # IEP: Inferred from Expression Pattern • Computational • # ISS: Inferred from Sequence or Structural Similarity • # ISO: Inferred from Sequence Orthology # ISA: Inferred from Sequence Alignment • # ISM: Inferred from Sequence Model # IGC: Inferred from Genomic Context • # RCA: inferred from Reviewed Computational Analysis • Author Statement • # TAS: Traceable Author Statement # NAS: Non-traceable Author Statement • # Curator Statement Evidence Codes # IC: Inferred by Curator • # ND: No biological Data available • Automatically-assigned • # IEA: Inferred from Electronic Annotation

  31. Best annotation? • Use DAS clients to get more information on genomic, gene or protein features • Protein Domains are especially useful • The Gene Ontology is useful for general classification • BUT be aware from where the annotation was derived

More Related