E N D
BIOLOGICAL DATABASES M.Prasad Naidu MSc Medical Biochemistry, Ph.D,.
INTRODUCTION The database • must be maintained as a central shareable resource • should provide easy-to-use software to access the information (web-pages...) • has to be structurally organised and fully annotated to find the information needed • should not contain redundant information • should be error free
Levels of protein sequence databases and structural organisation Primary database Primary Sequence AVILDRYFH Motif or Pattern Secondary [AS]-X-[IL]2-[DE] Secondary database Rosmann fold, GTP-binding domain... Structure database Tertiary Domain
Different Types Of Databases • Primary Databases. • Composite Databases. • Secondary Databases.
PRIMARY DATABASES • In 1980, Due to the flooding of sequence information, need to storage of sequence Data. • They contain sequence information. • Eg: NAProtein EMBL PIR Gen Bank MIPS DDBJ SWISS-Prot Tr-EMBL NRL-3D
PIR • Developed by National Biomedical Research Foundation in 1960’s by Margaret Dayhoff – to investigate evolutionary relationships between proteins. • Maintained by PIR, an association of Macromolecular sequence data collection centres • Pir at NBRF • International protein information database of Japan (JIPID). • Martinsried institute of Protein sequences (MIPS).
Quality of PIR Database Has been split into 4 different sections ranked according to quality: • PIR1: fully classified and annotated entries • PIR2: includes preliminary entries (may include redundancy) • PIR3: includes unverified entries • PIR4: contains conceptual translations
MIPS • Collects and processes sequence Data for the PIR. • Also distributed with Patch x ,a supplement of unverified protein sequences from external resources.
SWISS-PROT database • Produced by the Dept. of Medical Biochemistry at University of Geneva and the EMBL in 1986. • Was transferred to EBI in1994. • Further changed to Swiss institute of Bioinformatics-SIB. • Has a High level annotated entries with descriptions of functions, structure, post translational modifications.
Example of a Flat file: SWISS-PROT Q14790 ID ICE8_HUMAN STANDARD; PRT; 479 AA. AC Q14790; Q14791; Q14792; Q14793; Q14794; AC Q14795; Q14796; Q15780; Q15806; Q9UQ81; AC O14676; DT 01-NOV-1997 (Rel. 35, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 01-OCT-2000 (Rel. 40, Last annotation DT update) DE CASPASE-8 PRECURSOR (EC 3.4.22.-) (ICE-LIKE DE APOPTOTIC PROTEASE 5)(MORT1-ASSOCIATED CED-DE 3 HOMOLOG) (MACH) (FADD-HOMOLOGOUS ICE/CED-DE 3-LIKE PROTEASE) (FADD-LIKE ICE) (FLICE) DE (APOPTOTIC CYSTEINE PROTEASE)(APOPTOTIC DE PROTEASE MCH-5) (CAP4). GN CASP8 OR MCH5. Identification PROTEIN_SOURCE Gene name Description Date of entry Accession number Because ID codes can change
OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; OC Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; OC Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A., AND ALTERNATIVE RP SPLICING. RC TISSUE=Thymus, and B-cell; RX MEDLINE=96279826; PubMed=8681376; [NCBI, RX ExPASy, EBI, Israel, Japan] RA Boldin M.P., Goncharov T.M., Goltsev Y.V., Wallach D.; Organism species Organism classification References
Reference 2 and so on ... RT "Involvement of MACH, a novel MORT1/FADD-interacting protease, in RT Fas/APO-1- and TNF receptor-induced cell death."; RL Cell 85:803-815(1996). RN [2] RP X-RAY CRYSTALLOGRAPHY (2.8 ANGSTROMS). RX MEDLINE=99451259; PubMed=10508784; [NCBI, RX ExPASy, EBI, Israel, Japan] RA Blanchard H., Kodandapani L.,Mittl P.R.E., RA Di Marco RA, S., Krebs J.F., Wu J.C., RA Tomaselli K.J., Gruetter M.G.; RT "The three-dimensional structure of RT caspase-8: an initiator enzyme in RT apoptosis."; RL Structure 7:1125-1133(1999).
Function CC -!- FUNCTION: MOST UPSTREAM PROTEASE OF CC THE ACTIVATION CASCADE OF CASPASES CC RESPONSIBLE FOR THE FAS-RECEPTOR CC MEDIATED (CD95) AND TNFR-1 INDUCED CELL CC DEATH. BINDING TO THE ADAPTOR MOLECULE CC FADD RECRUITS IT TO EITHER RECEPTORS. CC THE RESULTING AGGREGATE CALLED THE CC DEATH-INDUCING SIGNALING COMPLEX (DISC) CC PERFORMS FLICE/MACH PROTEOLYTIC CC ACTIVATION. THE ACTIVE DIMERIC ENZYME IS CC THEN LIBERATED FROM THE DISC AND FREE TO CC ACTIVATE DOWNSTREAM APOPTOTIC PROTEASES. CC PROTEOLYTIC FRAGMENTS OF THE N-TERMINAL CC PROPEPTIDE (TERMED CAP3, CAP5 AND CAP6) CC ARE LIKELY RETAINED IN THE DISC. CLEAVES Comments
CC AND ACTIVATES CASPASE-3, -4, -6, -7, -9, CC AND -10. MAY PARTICIPATE IN THE GRANZYME B CC APOPTOTIC PATHWAYS. PROTEOLYTICALLY CC CLEAVES POLY(ADP-RIBOSE) POLYMERASE(PARP). CC HYDROLYZES THE SMALL- MOLECULE SUBSTRATE, CC AC- ASP-GLU-VAL-ASP-|-AMC. LIKELY TARGET CC FOR THE COWPOX VIRUS CRMA DEATH INHIBITORY CC PROTEIN. CC -!- SUBUNIT: HETERODIMER OF A 18 KDA (P18) CC AND A 10 KDA (P10) SUBUNIT. INTERACTS WITH CC CFLAR. CC -!- ALTERNATIVE PRODUCTS: 8 ISOFORMS; 1- CC ALPHA (SHOWN HERE), 2-ALPHA/MCH5-BETA, 3-CC ALPHA, 4-ALPHA, 1-BETA, 2-BETA, 3-BETA AND CC 4-BETA; ARE PRODUCED BY ALTERNATIVE CC SPLICING. Presence of subunits and of alternative proteins
CC -!- TISSUE SPECIFICITY: ALPHA 1 AND BETA 1 CC ISOFORMS ARE EXPRESSED IN A WIDE VARIETY CC OF TISSUES. HIGHEST EXPRESSION IN CC PERIPHERAL BLOOD LEUKOCYTES, SPLEEN, CC THYMUS AND LIVER. BARELY DETECTABLE IN CC BRAIN, TESTIS, AND SKELETAL MUSCLE. CC -!- PTM: GENERATION OF THE SUBUNITS CC REQUIRES ASSOCIATION WITH THE DISC, CC WHEREAS ADDITIONAL PROCESSING IS LIKELY CC DUE TO THE AUTOCATALYTIC ACTIVITY OF THE CC ACTIVATED PROTEASE. GRANZYME B AND CC CASPASE-10 CAN BE INVOLVED IN THESE CC PROCESSING EVENTS. CC -!- SIMILARITY: BELONGS TO PEPTIDASE CC FAMILY C14; ALSO KNOWN AS THE CASPASE CC FAMILY. CONTAINS 2 DEATH EFFECTOR CC DOMAINS (DED). Tissue specificity, Post-translational modifications , Similarity
DR EMBL; X98172; CAA66853.1; -. [EMBL / DR GenBank / DDBJ] [CoDingSequence] DR EMBL; X98173; CAA66854.1; -. [EMBL / DR GenBank / DDBJ] [CoDingSequence] DR EMBL; X98174; CAA66855.1; -. [EMBL / DR GenBank / DDBJ] [CoDingSequence] DR PDB; 1QDU; PRELIMINARY. [ExPASy / RCSB] DR SWISS-3DIMAGE; ICE8_HUMAN. DR InterPro; IPR001875; DED. DR Pfam; PF01335; DED; 2. DR Pfam; PF00655; ICE_p10; 1. DR Pfam; PF00656; ICE_p20; 1. DR PROSITE; PS50207; CASPASE_P10; 1. DR PROSITE; PS50208; CASPASE_P20; 1. DR PROSITE; PS50168; DED; 2. Database cross-referencewith access number
DR ProDom [Domain structure / List of seq. DR sharing at least 1 domain] DR BLOCKS; Q14790. DR DOMO; Q14790. DR PROTOMAP; Q14790. DR PRESAGE; Q14790. DR DIP; Q14790. DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW Hydrolase; Thiol protease; Apoptosis; KW Zymogen; Alternative splicing; KW 3D-structure. Keywords
Subunits Variant and sequence error Active site position FT PROPEP 1 216 FT CHAIN 217 374 CASPASE-8 SUBUNIT P18. FT PROPEP 375 384 FT CHAIN 385 479 CASPASE-8 SUBUNIT P10. FT ACT_SITE 317 317 FT ACT_SITE 360 360 FT DOMAIN 2 80 DED 1. FT DOMAIN 100 177 DED 2. FT VARSPLIC 102 102 R -> RFHFCRMSWAEANSQC FT QTQSVPFWRRVDHLLIR (IN ISOFORM 4 ALPHA). FT VARSPLIC MISSING (IN ISOFORM 2 ALPHA, FT ISOFORM 4 ALPHA AND ISOFORM 4 BETA). FT CONFLICT 285 285 D -> H (IN REF. 3 AND FT 5). FT CONFLICT 294 294 E -> D (IN REF. 4). Feature Table
SQ SEQUENCE 479 AA; 55391 MW; SQ 7A5FEAA6B39B582F CRC64; MDFSRNLYDI GEQLDSEDLA SLKFLSLDYI PQRKQEPIKD ALMLFQRLQE KRMLEESNLS FLKELLFRIN RLDLLITYLN TRKEEMEREL QTPGRAQISA YRVMLYQISE EVSRSELRSF KFLLQEEISK CKLDDDMNLL DIFIEMEKRV ILGEGKLDIL KRVCAQINKS LLKIINDYEE FSKERSSSLE GSPDEFSNGE ELCGVMTISD SPREQDSESQ TLDKVYQMKS KPRGYCLIIN NHNFAKAREK VPKLHSIRDR NGTHLDAGAL TTTFEELHFE IKPHDDCTVE QIYEILKIYQ LMDHSNMDCF ICCILSHGDK GIIYGTDGQE APIYELTSQF TGLKCPSLAG KPKVFFIQAC QGDNYQKGIP VETDSEEQPY LEMDLSSPQT RYIPDEADFL LGMATVNNCV SYRNPAEGTW YIQSLCQSLR ERCPRGDDIL TILTEVNYEV SNKDDKKNMG KQMPQPTFTL RKKLVFPSD // The same file in an oriented Web looking via SWISS-Prot
TrEMBL database • Designed as a supplement to SWISS-PROT • Benefits by providing translation of all coding sequences • Consists of 2 sections SP-TrEMBL with entries that will be incorporated into SWISS-PROT after annotation REM-TrEMBL with entries that are not destined to be included in SWISS-PROT (synthetic sequences, conceptual translations,…) do not compromise the quality of the SWISS-PROT
NRL-3D databases • Contains onlyprotein sequences extracted from the Brookhaven Protein Databank (PDB) But includes: • bibliographic references and MEDLINE cross- references • secondary structure information • active and binding site, modification in the sequence • details on experimental method, resolution, R-factor,…
Composite protein sequence Databases • 1) To render sequence searching more efficient • To answer the questions of choosing the ‘best’ primary databases? (the most up-to-date, which database to use? ,…)
Some of the Composite protein sequence databases available NRDB OWL MIPSX SP+TrEMBL PDB SWISS-PROT PIR SWISS-PROT SWISS-PROT PIR MIPSOwn TrEMBL PIR GenBank MIPSTrn GenPept NRL-3D MIPSH SWISS-PROT update PIRMOD GenPeptupdate NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqIP
NRDB • NRDB (Non-Redundant Database) is built locally at the NCBI. • It is a composite of -Gen pept. (Genbanks CDS translations) -PDB sequences. -Swissprot update (updates of swissprot) -PIR -Gen pept updates (daily updates of Gen pept) • NRDB is not prone to errors. • NRDB is the database of BLAST services.
OWL • Non redundant protein Sequence database. • Built at university of Leeds in collaboration with the Dares bury Laboratory in Washington. • Composite of -Swiss-Prot. -PIR -Genbank. -NRL-3D.
MIPS X • Merged database produced at the Max Planck institute in Martinsried Institute of Protein sequences. • Composite of -PIR NRL-3D -MIPSOWN Swiss-prot -MIPS Trn EM trans -MIPS H GB trans -PIRMOD
Swiss-Prot +TrEmbl • EBI constructed database. • Composite of both Swiss-Prot + TrEmbl. • Minimally redundant. • SRS is used to retrieve the information.