Protein Sequence Databases

Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

Protein Sequence Databases • Link between mass spectra and proteins • A protein’s amino-acid sequence provides a basis for interpreting • Enzymatic digestion • Separation protocols • Fragmentation • Peptide ion masses • We must interpret database information as carefully as mass spectra.

More than sequence… Protein sequence databases provide much more than sequence: • Names • Descriptions • Facts • Predictions • Links to other information sources Protein databases provide a link to the current state of our understanding about a protein.

Much more than sequence Names • Accession, Name, Description Biological Source • Organism, Source, Taxonomy Literature Function • Biological process, molecular function, cellular component • Known and predicted Features • Polymorphism, Isoforms, PTMs, Domains Derived Data • Molecular weight, pI

Database types

SwissProt • From ExPASy • Expert Protein Analysis System • Swiss Institute of Bioinformatics • ~ 515,000 protein sequence “entries” • ~ 12,000 species represented • ~ 20,000 Human proteins • Highly curated • Minimal redundancy • Part of UniProt Consortium

TrEMBL • Translated EMBL nucleotide sequences • European Molecular Biology Laboratory • European Bioinformatics Institute (EBI) • Computer annotated • Only sequences absent from SwissProt • ~ 10.5 M protein sequence “entries” • ~ 230,000 species • ~ 75,000 Human proteins • Part of UniProt Consortium

UniProt • Universal Protein Resource • Combination of sequences from • Swiss-Prot • TrEMBL • Mixture of highly curated (Swiss-Prot) and computer annotation (TrEMBL) • “Similar sequence” clusters are available • 50%, 90%, 100% sequence similarity

RefSeq • Reference Sequence • From NCBI (National Center for Biotechnology Information), NLM, NIH • Integrated genomic, transcript, and protein sequences. • Varying levels of curation • Reviewed, Validated, …, Predicted, … • ~ 9.7 M protein sequence “entries” • ~ 209,000 reviewed, ~ 90,000 validated • ~ 39,000 Human proteins

RefSeq • Particular focus on major research organisms • Tightly integrated with genome projects. • Curated entries: NP accessions • Predicted entries: XP accessions • Others: YP, ZP, AP

IPI • International Protein Index • From EBI • For a specific species, combines • UniProt, RefSeq, Ensembl • Species specific databases • HInv-DB, VEGA, TAIR • ~ 87,000 (from ~ 307,000 ) human protein sequence entries • Human, mouse, rat, zebra fish, arabidopsis, chicken, cow

MSDB • From the Imperial College (London) • Combines • PIR, TrEMBL, GenBank, SwissProt • Distributed with Mascot • …so well integrated with Mascot • ~ 3.2M protein sequence entries • “Similar sequences” suppressed • 100% sequence similarity • Not updated since September 2006 (obsolete)

NCBI’s nr • “non-redundant” • Contains • GenBank CDS translations • RefSeq Proteins • Protein Data Bank (PDB) • SwissProt, TrEMBL, PIR • Others • “Similar sequences” suppressed • 100% sequence similarity • ~ 10.5 M protein sequence “entries”

Others • HPRD • Manually curated integration of literature • PDB • Focus on protein structure • dbEST • Part of GenBank - EST sequences • Genome Sequences

Human Sequences • Number of Human genes is believed to be between 20,000 and 25,000

DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

Genome Browsers • Link genomic, transcript, and protein sequence in a graphical manner • Genes, ESTs, SNPs, cross-species, etc. • UC Santa Cruz • http://genome.ucsc.edu • Ensembl • http://www.ensembl.org • NCBI Map View • http://www.ncbi.nlm.nih.gov/mapview

Shows many sources of protein sequence evidence in a unified display UCSC Genome Browser

PeptideMapper Web Service I’m Feeling Lucky

Unannotated Splice Isoform

Accessions • Permanent labels • Short, machine readable • Enable precise communication • Typos render them unusable! • Each database uses a different format • Swiss-Prot: P17947 • Ensembl: ENSG00000066336 • PIR: S60367; S60367 • GO: GO:0003700;

Names / IDs • Compact mnemonic labels • Not guaranteed permanent • Require careful curation • Conceptual objects • ALBU_HUMAN • Serum Albumin • RT30_HUMAN • Mitochondrial 28S ribosomal protein S30 • CP3A7_HUMAN • Cytochrome P450 3A7

Description / Name • Free text description • Human readable • Space limited • Hard for computers to interpret! • No standard nomenclature or format • Often abused…. • COX7R_HUMAN • Cytochrome c oxidase subunit VIIa-related protein, mitochondrial [Precursor]

FASTA Format

FASTA Format • > • Accession number • No uniform format • Multiple accessions separated by | • One line of description • Usually pretty cryptic • Organism of sequence? • No uniform format • Official latin name not necessarily used • Amino-acid sequence in single-letter code • Usually spread over multiple lines.

Organism / Species / Taxonomy • The protein’s organism… • …or the source of the biological sample • The most reliable sequence annotation available • Useful only to the extent that it is correct • NCBI’s taxonomy is widely used • Provides a standard of sorts; Heirachical • Other databases don’t necessarily keep up • Organism specific sequence databases starting to become available.

Buffalo rat Gunn rats Norway rat Rattus PC12 clone IS Rattus norvegicus Rattus norvegicus8 Rattus norwegicus Rattus rattiscus Rattus sp. Rattus sp. strain Wistar Sprague-Dawley rat Wistar rats brown rat laboratory rat rat rats zitter rats Organism / Species / Taxonomy

Controlled Vocabulary • Middle ground between computers and people • Provides precision for concepts • Searching, sorting, browsing • Concept relationships • Vocabulary / Ontology must be established • Human curation • Link between concept and object: • Manually curated • Automatic / Predicted

Controlled Vocabulary

Ontology Structure • NCBI Taxonomy • Tree • Gene Ontology (GO) • Molecular function • Biological process • Cellular component • Directed, Acyclic Graph (DAG) • Unstructured labels • Overlapping?

Ontology Structure

Protein Families • Similar sequence implies similar function • Similar structure implies similar function • Common domains imply similar function • Bootstrap up from small sets of proteins with well understood characteristics • Usually a hybrid manual / automatic approach

Protein Families

Protein Families • PROSITE, PFam, InterPro, PRINTS • Swiss-Prot keywords • Differences: • Motif style, ontology structure, degree of manual curation • Similarities: • Primarily sequence based, cross species

Gene Ontology • Hierarchical • Molecular function • Biological process • Cellular component • Describes the vocabulary only! • Protein families provide GO association • Not necessarily any appropriate GO category. • Not necessarily in all three hierarchies. • Sometimes general categories are used because none of the specific categories are correct.

Protein Sequence Databases

Protein Sequence Databases

Presentation Transcript

Sequence Databases

Sequence Databases

Sequence Databases

PROTEIN SEQUENCE ANALYSIS

Protein sequence databases http://education.expasy.org/cours/Murcia2011/

Sequence databases

Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

PROTEIN DATABASES

Querying Sequence Databases

Searching Sequence Databases

Protein sequence analysis

Sequence Databases

Searching Sequence Databases

Protein Primary Sequence

Protein Sequence

Protein Sequence Databases

Sequence Databases

Nucleotide Sequence Databases

Protein sequence databases

Protein Databases

Protein Sequence Motifs

Sequence Databases