Functional Genomics of Plant Phosphorylation

Functional Genomics of Plant Phosphorylation

Bioinformatics Michael Gribskov Douglas W. Smith

Bioinformatics • Develop a plant phosphorylation database • Completely classify and annotate plant protein kinases and phosphatases • Develop data models and implement data handling procedures for experimental subprojects

Bioinformatics • Timeline • Year 1: Implement Plant PP Functional Genomics Database Data design and implementation for knockouts project Classify Kinase and Phosphatases • Year 2: Data design and implementation for proteomics Continue knockout acquisition Screen completed genome for additional targets Classify and annotate functional domains • Year 3: Data design and implementation for interactions Continue knockout and add proteomics data acquisition Classify and annotate functional domains • Year 4: Complete knockouts Map A. thaliana functional data to other plant genomes Integrate expression data from external sources Continue proteomics and interaction data acquisition • Year 5: Continue proteomics and interaction data acquisition Update and extend classifications

Experimental Subprojects • Gene Knockouts • Isolate knockout mutations for all the protein kinases and phosphatases encoded by the arabidopsis genome • Proteomics • Create a two-dimensional gel phosphoprotein database • Integrate the phosphoprotein database with gene sequences • Interactions of Signaling Components • Develop three-hybrid and split-hybrid screens for the analysis of plant protein kinases • Begin genome-wide screening with individual arabidopsis protein kinases using the three-hybrid and split-hybrid approaches

Communication • Electronic communication will be used to facilitate interactions between PPP functional genomics personnel • Features of COW • Responses organized by Topics • Topics permit detailed conversations focused on specific PPP issues or projects • Provides archive of conversations • Coordination & management of issues • Text and HTML possible: Web links to data

COW Interface and PPP Conference:

Tools and Approaches • Bioinformatics group will draw on a broad range of community experience and resources. • Resources developed at UCSD include • Protein Kinase Resource • Molecular Information Agent • Family Pairwise Search • Molecular Pattern Recognition • Profile/MEME/MAST • DictyDB • INFO

Plant PP Functional Genomics Database • http://www.sdsc.edu/mpr/plant_p • Data definitions in STAR for sequences and features derived from Protein Kinase Resource • Relational database (MySQL) • Currently populated with 850 plant kinases and kinase substrates • Preliminary classification into Hanks and Quinn groups

Plant PP Functional Genomics Database

Feature Sequence UID Name UID Name, alternatesProperties Features List Alignment Method Sequence Feature 1 Feature Members List Feature 2 Feature 3 Feature 4 Positions Ranges SourceURL Dates Sequence 1Positions Ranges Sequence2Positions Ranges Sequence 3Positions Ranges Sequence 4Positions Ranges PKR/PPFGDB Data Model

STAR data definition language Used for CIF and mmCIF Data model for PDB at SDSC/Rutgers/NIST Sequence and features dictionaries include methods (PERL) to derive data from online sources (SwissProt, PIR, NCBI) PKR/PPFGDB Data Model

Plant Phosphoprotein Functional Genomics Database • 850 sequences - all matches to keyword “protein kinase” in Entrez • 770 kinases • 80 other (mostly kinase substrates) • Classification (Hanks and Quinn, 1994) • AGC (9) 72I VIII Other 1 46 25 • CAMK (3) 204I II Other 177 16 11 • CMGC (6) 157I II III IV V54 45 35 9 14 • PTK (23) 23 • OPK (14) 297II IV VIII X XII Other 7 10 5 251 5 19 • Unassigned 17

Plant Phosphoprotein Functional Genomics Database • Classification • 1449 kinases from PKR • Remove very similar sequences using WU Purge program • 141 probe sequences • Use FPS to calculated P-value of match to Kinase, class, and subclass • 141 x 1850 SW comparisons on Biocellerator

Family Pairwise Search (FPS) • Combines information from multiple queries • Identifies family membership based on a known panel • Does not require multiple alignment or “training” • Identifies motifs and folds using known panel • SCOP families • PROSITE motifs • identifies homologs based on similarity to the entire group of sequences • Not sensitive to spurious matches • Sensitive “product of P-values” statistic and effective family size • Server fps.sdsc.edu

Fibronectin type-iii domain Sh2 domain Sh3 domain Phorbol-ester and DAG binding domain Ig-like domain C2 domain Pleckstrin homology domain Lim (lin-11 isl-1 mec-3) domain Dhr domain Guanylate kinase domain Phorbol-ester and dag binding domain F5/8 type c (phospholipid-binding) domain Dimerization and phosphorylation domain Transmitter domain Calmodulin-binding domain Phospholipid binding domain P21-binding domain Nuclear localization signal Egf-like domain Polo-homology domain Collagen-like Death domain Gap domain Myosin domain Cbs domain Sam domain Fha domain Cub domain Actin-binding domain Rii-beta subunit binding domain P10 binding site domain Drbm domain Receiver domain Pac motif Zinc-fingers Mads domain Metallopeptidase domain Heat repeats domain Leucine-rich repeats Coiled coil Many low entropy regions Plant Phosphoprotein Functional Genomics Database Domains and Features found with Kinase catalytic domains

Protein Kinase Resource • Http://www.sdsc.edu/Kinases

Protein Kinase Resource • PKR Goal • Integrate sequence, structure, genetics, function and disease information • How many kinases are there? • Currently about 4500 kinase sequences (approximately 2-fold redundancy) • 50+ three-dimensional structures

Protein Kinase Resource

Protein Kinase Resource • CAMK II

Protein Kinase Resource

Protein Kinase Resource • The SH3 (Src Homology 3) domain is a small conserved sequence of about 60 amino acid residues that interacts with proline-rich peptides to form protein aggregates. • Structurally, the SH3 domain folds as a compact beta-barrel of five to six anti-parallel beta-strands. The hydrophobic beta-strands are connected by hydrophilic loops to form two orthogonal beta-sheets bringing the amino and the carboxyl termini of the domain close to each other. The ligands of the SH3 domains are peptides containing a ten residue consensus sequence, XPXXPPPFXP (where X is any amino acid residue, F is phenylalanine and P is peoline). This peptide forms a left-handed polyproline (PPII) helix that lies along the binding site of the SH3 domain, with its prolines interacting with the aromatic residues on the hyrophobic face of the SH3 domain. • Functionally, the SH3 domain is involved in cell-cell communication and signal transduction from the cell surface to the nucleus. It acts as part of an adapter molecule and recruits downstream proteins in a signalling pathway. For example, in the eye development pathway in the Drosophila (Sevenless Pathway), a ligand from the R8 cell, Boss (Bride of Sevenless), binds two molecules of Sev (Sevenless) receptors on the surface of the R7 cell. This binding dimerizes the receptors which are Protein Tyrosine Kinases, so now they are close to one another and can transphosphorylate each other.

STAR Formatted Data A Sample Data Block Using the Sequence Dictionary

STAR Data Block (Sequence) data_ABL1_HUMAN loop_ _sequence.id _sequence.type _sequence.name _sequence.date_create _sequence.update_sequence _sequence.update_annotation _sequence.synonym _sequence.citation _sequence.length _sequence.mol_weight _sequence.sequence PKRPSEQ000008 PROTEIN ABL1_HUMAN 1986-07-21 1990-04-01 1997-02-01 'ABL1, ABL' ; RN [1] RP SEQUENCE FROM N.A. RC TISSUE=FIBROBLAST; RX MEDLINE; 90082420. [, Geneva] RA FAINSTEIN E., EINAT M., GOKKEL E., MARCELLE C., CROCE C.M., RA GALE R.P., CANAANI E.; RL ONCOGENE 4:1477-1481(1989). ; 1130 122944 ; MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSE NDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVN SLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTAS DGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERT DITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ LLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFI ;

STAR Data Block (uid) loop_ _uid.id _uid.status _uid.date_create _uid.update _uid.keywords _uid.description_short _uid.description_long PKRPSEQ000008 ACTIVE 1997-08-4 1997-08-4 ; TRANSFERASE, TYROSINE-PROTEIN KINASE, PROTO-ONCOGENE, ATP-BINDING, PHOSPHORYLATION, SH2 DOMAIN, SH3 DOMAIN, CHROMOSOMAL TRANSLOCATION, 3D-STRUCTURE, ALTERNATIVE SPLICING ; ; PROTO-ONCOGENE TYROSINE-PROTEIN KINASE ABL (EC ) (P150) (C-ABL) ; ; CC -!- CATALYTIC ACTIVITY: ATP + A PROTEIN TYROSINE = ADP + CC PROTEIN TYROSINE PHOSPHATE. CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. CC -!- TISSUE SPECIFICITY: WIDELY EXPRESSED. CC -!- DISEASE: PARTICIPATES IN A T(9;22)(Q34;Q11) CHROMOSOMAL CC TRANSLOCATION THAT PRODUCES A BCR-ABL ONCOGENE RESPONSIBLE FOR CC CHRONIC MYELOID LEUKEMIA (CML), ACUTE MYELOID LEUKEMIA (AML), AND CC ACUTE LYMPHOBLASTIC LEUKEMIA (ALL). CC -!- ALTERNATIVE PRODUCTS: TWO FORMS, IA AND IB HAVE ALTERNATIVE AMINO CC TERMINI. CC -!- SIMILARITY: TO OTHER PROTEIN-TYROSINE KINASES IN THE CATALYTIC CC DOMAIN. BELONGS TO THE ABL SUBFAMILY. CC -!- SIMILARITY: CONTAINS A COPY EACH OF THE SH2 AND SH3 DOMAINS. ;

STAR Data Block (xref) _xref.id PKRPSEQ000008 loop_ _xref.dbname _xref.primary _xref.acc _xref.name _xref.date _xref.type _xref.source_db _xref.source_accession _xref.comment SWISS yes P00519 ABL1_HUMAN 1997-02-01 PROTEIN . . 'blah blah' EMBL no X16416 . 1997-02-01 NUCLEIC SWISS P00519 . EMBL no M14752 . 1997-02-01 NUCLEIC SWISS P00519 . EMBL no S69223 . 1997-02-01 NUCLEIC SWISS P00519 . PIR no A25582 TVHUA 1997-02-01 PROTEIN SWISS P00519 . PDB no 1AB2 1AB2 1997-02-01 STRUCTURE SWISS P00519 . PDB no 1ABL 1ABL 1997-02-01 STRUCTURE SWISS P00519 . OMIM no 189980 . 1997-02-01 NUCLEIC SWISS P00519 .

STAR Data Block (feature table) _feature_table.id PKRPSEQ000008 loop_ _feature_table.feature_id _feature_table.feature_name _feature_table.feature_location _feature_table.feature_type _feature_table.feature_source PKRFEAT85 SP-SH3 61-121 DOMAIN PRIMARY PKRFEAT86 SP-SH2 127-217 DOMAIN PRIMARY PKRFEAT87 SP-PROTEIN_KINASE 242-493 DOMAIN PRIMARY PKRFEAT88 SP-NUCLEAR_LOCALIZATION 605-609 DOMAIN PRIMARY PKRFEAT89 SP-PRO-RICH 782-1019 DOMAIN PRIMARY PKRFEAT90 SP-ATP 248-256 SITE PRIMARY PKRFEAT91 SP-ATP 271-271 SITE PRIMARY

Molecular Information Agent Transparently Linking Electronic Data Resources

Molecular Information Agent • MIA - Molecular Information Agent • Web resources often change URLs • Web resources often change available services • Laboratory scientists may only use services occasionally • May not know what services are available • May not know how to efficiently use them • Want quick answers (where possible)

Molecular Information Agent • MIA - Molecular Information Agent • Search for linked information starting from a single item • Automatic location and validation of web links • All links are visited and the presence of a usable page confirmed • Useful information extracted and used as a basis of further searches • Basis of sequence/structure/information crosslinking in MSD

Template Resource Query Parser Query Manager Molecular Information Agent Synopsis &Reference Resource Keywords Parser

PDB Report 3Dee ProMotif Swiss2D PAGE SwissModel Protein Motions ProCheck Macromolecular File PeptideMass ProtScale STING GRASS ColumbiaPic Compute pI/MW ProfileScan CATH FSSP SCOP ProtPattern ProDom TOPS ProtoMap DOMO Swiss 3dImage SEView Swiss-Prot Swiss-Prot PIR PIR OMIM OMIM DSSP EC_PRECISE WIT IMB JenaImage EC_UNPRECISE ENZYME BiochemicalPathways PATH_MAP_NUM PDBSum SP_Name PDB MMDB PDB_CODE NCBI_TAXONOMY NCBITaxonomy BLAST SP_ID Scan PROSITE GI_Nuc PIR_ENTRY PIR_ID PROSITE Prosite NCBI ProClass PRODOC ProDoc OMIM GI_PROT NCBI_UID Medline GeneCards EMBL_UID EMBL GDB GENE_SYM MGI MouseGenome DB

MMDB OMIM TGD CATH PIR NCBI WIT DSSP PDB HSSP Medline EMBL SCOP FSSP EC Enzyme SWISS_PROT PROTOMAP DOMO PROSITE PRODOM MIA - Selected Data Resources Sequence Structure Motifs/Domains Other

Molecular Information Agent • Limit Searches • Do not requery resources • Time-outs and availability issues

Problems Web resources often change URLs Web resources often change available services Laboratory scientists may only use services occasionally May not know what services are available May not know how to efficiently use them Want quick answers (where possible) MIA - Molecular Information Agent

Simplifies finding and using information Search for all linked information beginning from a single item Automatic location and validation of web links All links are visited and the presence of a usable page confirmed Useful information extracted and used as a basis of further searches Basis of sequence/structure/information crosslinking in PDB MIA - Molecular Information Agent

Query: Gene Name ProteinSequences DNASequences Motifs andDomains

Motifs andDomains LiteratureReferences PhysicalParameters Images, Genes,Taxonomy ...

Macromolecular Pattern Recognition Profile Analysis, MEME & MAST

F KE AF SL F D K DGDG T I TTK E L GT VM RSL F F KE KE AF AF SL SL F F D D K K DGDG DGDG C C I I TTK TTK E E L L GT GT VM VM RSL RSL I I RE RE AF AF RV RV F F D D K K DGNG DGNG Y Y I I SAA SAA E E L L RH RH VM VM TNL TNL I I KA KA II II QK QK A A D D A A N N K K DG DG K K I I DR DR EE EE F F MK MK LI LI KS. KS. I DA II KK A D G N N DG K I RV QE F VK MI ESS F F NK NK AF AF EL EL Y Y D D Q Q DGDG DGDG Y Y I I DE DE NE NE L L DA DA LL LL KDL KDL Molecular Pattern Recognition Genes Sequences AlignedSequences Motifs

Molecular Pattern Recognition

Molecular Pattern Recognition • Motif Learning/Description • Profile • Analytical calculation of motif description using finite mixture model • Learning is fast - seconds on workstation • Database search is slow - hours on workstation, about 90 sec on Compugen Bio-XLP • MEME • Unsupervised learning using expectation maximization • Learning is quadratic - typically minutes on T3E • Searching database is linear - seconds on workstation

Profile Analysis • Describes protein structural and sequence motifs using a position specific scoring matrix and position specific gap penalties calculated from a sequence or multiple sequence alignment • Evolutionary profile - sequence information • Structural Profile - structural information • Evolutionary/Structural Profile sequence and structure

Evolutionary Profile • Mixture distribution using a biologically relevant model • Explicit evolutionary model for each aligned column • Sequences weighted for similarity • Find the group of preferred residues at each position • Weight mixture components by probability of observed data given the model distribution

Evolutionary Profile Anc (w) PAM A (0.61) 1 T (0.17) 64 S (0.14) 64 E (0.76) 1 D (0.63) 16 Q (0.28) 64 N (0.16) 128D (0.84) 1 E (0.55) 32 N (0.36) 32 Q (0.09) 128L (0.75) 32 M (0.36) 128 I (0.31) 64 V (0.30) 64V (0.53) 32 I (0.31) 128 T (0.16) 64 A (0.15) 64 M (0.09) 256 L (0.08) 256

Functional Genomics of Plant Phosphorylation

Functional Genomics of Plant Phosphorylation

Presentation Transcript

Functional Genomics – Why?

Variation and Functional Genomics

Microbial Functional Genomics

The Australian Centre for Plant Functional Genomics Pty Ltd

FUNCTIONAL GENOMICS 2

Computational functional genomics

FUNCTIONAL GENOMICS COURSE 26.5.2006

Microbial Functional Genomics

Functional Genomics

Functional Genomics

Functional Genomics with R

CTD2: Functional Cancer Genomics

Presentation to the Australian Centre for Plant Functional Genomics

Functional Genomics

Functional Genomics

Mark Tester Australian Centre for Plant Functional Genomics University of Adelaide

Functional genomics

Microbial Functional Genomics

Functional Genomics

Functional genomics + Data mining