Oracle10g Search Features for Biological Discovery Gap

Can New Oracle10g Search Features Help Bridge the Biological Discovery Gap? Marcel Davidson Head of Data Management Jake Y. Chen, Ph.D. Head of Computational Proteomics & Principle Bioinformatics Scientist

Messages • New Informatics Challenges in Protein Interactomics R&D • Scale, integration, discovery issues • A data-driven discovery-oriented framework • “Enabling” Features in 10g? • Biological data integration? • Biological data analysis integration?

Outline • Data-driven Discovery-oriented Computational Framework • 10g Regular Expression Case Studies • 10g BLAST Case Studies

Why Myriad Maps Protein-Protein Interactions Conventional Drug Discovery Post-Genomic Drug Discovery GPCR enzyme Nucleus Nucleus hormone receptor novel, more specific targets non-specific targets novel, druggable targets target validation enhanced pre-validation target pool lead discovery, optimization

Principle of the Yeast Two-Hybrid (Y2H) System Scenario A: Human Proteins X and Y do Interact Activation Domain Prey Human Protein Y Human Protein X Reporter mRNA Reporter mRNA Readout: Yeast colonies grow Reporter mRNA DNA Binding Domain Bait Reporter Gene DNA Scenario B: Human Proteins X and Z do not Interact Prey Activation Domain Human Protein Z Human Protein X Readout: No growth of yeast colonies ( No Reporter Gene Activity ) DNA Binding Domain Bait Reporter Gene DNA

Data Collected from Y2H System Perform BLAST Against Human REFSEQ DB

Protein Interaction Network (Snapshot of ~8,000 interactions)

>1.5million sequence fragments • ~250,000 search experiments performed • several TB data storage Experimental Measurement ? • ~1000 relevant interactions for each interested pathways DistilledInformation • 1-10 novel drug targets per disease Marketable Knowledge Knowledge Discovery (KD) Challenges • ~80,000 unique interactions • ~100 biological data sources Protein Interaction Data $$$, drugs, … • Data-driven • Discovery-oriented

Data Cleansing • Statistical Data Analysis • Domain-specific • Data Modeling Genomics/Functional Genomics Data Reduce Data Noise Represent Interactions and Pathways • Data Integration • LIMS Programming E-RDBMS Collect raw sequences and lab condition measurements Organize Data in Regulatory Pathways • DB Querying • Visualization Select and Validate Drug targets • Knowledge Curation KD in Interaction-based Proteomics

Bioinformatics DB Framework Y2H Data Processing and Analysis DB Lab_Seq, Seq_Match, Y2H_Mart • Annotation DB • RefSeq, LocusLink, GO, OMIM, CGAP, Protein Kinase DB, GPCR DB, Ensemble, Curation, … • Y2H Interaction Data Mart • Y2h_Mart

A Schema Fragment to Manage Sequence Similarity Results Jake Yue Chen and John Carlis (2003) Genomic Data Modeling. Information Systems Journal, 28(4), p287-310.

Interaction Matrix using Randomly Ordered Locus IDs • 12,958 unique Interactions • 1955 bait loci • 2766 prey loci Jake Yue Chen, et al (2003) Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003. Stanford University, Stanford, CA.

Oracle10g Regular Expressions: Powerful String Processing • RE new tools in Oracle10g • Search and manipulate data strings of arbitrary complexity • Prior database solutions • SQL LIKE operator • Java stored procedures, C external libraries • Prior non-database solutions: AWK, SED, GREP, PERL, etc. • Done now inside database • Facilitates rapid data-centric analysis

Case1: Retrieving Protein data from SGD (Saccharomyces Genome Database) ORF Identifier Associated Amino Acid Sequence

HTTP Raw Data </script> </head><body><body bgcolor='#FFFFFF'> <table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"> <input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /> </form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRAS WRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESK VFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFS VFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK* </pre><hr size="2" width="75%"> <table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"> <input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a> </form></td></tr></table></body></html> Need to parse out embedded AA Sequence

Function to Return AA Sequence Given ORF Parameterized ORF Id Web site URL create or replace function orf2seq ( p_orf in varchar2 ) return varchar2 is v_stream clob; strt number; begin -- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map') ); -- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1); -- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]', ''); -- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}')); end; RegExp to remove control chars from HTTP stream RegExp to extract AA sequence

Amino Acid Sequence for ORF ‘YDR099W’ SQL> select orf2seq('YDR099W') from dual; ORF2SEQ('YDR099W') -------------------------------------------------------------------------------- MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQ VELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEI ATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK Elapsed: 00:00:01.24 Elapsed time <2 sec. (network latency) SQL> insert into pseq (orf_id, sequence) 2 values ('YDR099W', orf2seq('YDR099W'));

Case 2: Motif Searching in Proteins TKP TKP motif pattern PROSITE database of protein sequence motifs ID TYR_PHOSPHO_SITE; PATTERN. AC PS00007; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Tyrosine kinase phosphorylation site. PA [RK]-x(2,3)-[DE]-x(2,3)-Y. CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /SKIP-FLAG=TRUE; DO PDOC00007; Source: http://www.expasy.org/prosite/ps_frequent_patterns.txt • TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y. • R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, Y=Tyrosine, x=any AA • Oracle10g Regular Expression Equivalent • [RK].{2,3}[DE].{2,3}[Y] 1 Aspartate or Glutamate 2 – 3 Any 2 – 3 Any 1 Arginine or Lysine 1 Tyrosine

SQL Example: Retrieving all Interacting Proteins with TKP select distinct substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4 from target_db a, y2h_interaction_p b where a.refseq_id like 'NP%' and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) = b.prey_refseq) ; Returns first 4 instances of TKP in each sequence Returns all rows with TKP site

SQL Example Output Motif #2 at offset 50 Motif #3 at offset 62 Motif #1 at offset 8 REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS ------------ ---------- ----------- ----------- ----------- ----------- NP_003961 1465 14 202 347 537 NP_003968 330 241 0 0 0 NP_003983 490 8 50 62 93 NP_004001 3562 3085 0 0 0 ... MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRSHDRLPYQRRYRERRDSDTYRCEERSPSFG EDYYGPSRSRHRRRSRERGPYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVG NLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLG KNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVAD FGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHR TRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEER SFHTSRNPSR Motif #4 at offset 93 [RK].{2,3}[DE].{2,3}[Y] Result: 702 (56%) interacting proteins with TKP site

Is 56% TKP in interacting proteins significant? • Random sample test of all NP entries • N = 33 random samples • Sample size 7.4% (~1251) • Sample mean = 515 • SD = 17.2 • Significance level < 1E-30

Similarity Search (Sequence Comparison): A Routine Biology Task k Pair-wise Comparison Results nTarget Sequences A Query Sequence Similarity Search has not been integrated into the DB system.

Using BLAST can be a laborious process & a data-management hell • Custom setup of BLAST target database • Iterate through query sequences: “Batch BLAST” • Export/parse/filter/import data <-> DBMS • Integration of results with external data

Case 1: Oracle 10g BLASTN as a sequence identification tool -- A sequence fragment with a sequence_id = 100 -- Sequence is stored in the query_db table. TACACACCTCGGCGTCGCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGAAGCCGCTCTGC SELECT t.t_seq_id, t.expect FROM TABLE ( BLASTN_MATCH ( (select sequence FROM query_db where sequence_id = 100), CURSOR (select refseq_id, sequence_string FROM target_db where refseq_id like 'NM_%') ) ) t WHERE t.expect < 1E-20; T_SEQ_ID EXPECT ----------------- -------------- NM_016333.2 0

Case 2: Discovering “Interlogs” Yeast Protein Interactome Human Protein Interactome Homology Mapping A X C B Y Z Interlogs: (A|X, B|Y) and (A|X, B|Z)

A Computational Intensive Task • Data to use • Yeast Protein-Protein Interaction Data • Yeast Protein Sequences • Human Protein-Protein Interaction Data • Human Protein Sequences & Annotations • Analysis to prepare • Homology search: yeast vs. human proteins • Things to consider • Collect/parse public data from web • Import/export data for BLAST • Connect analysis result to internal data  Missing Data  Laborious Traditional way? Or inside DBMS?

Pipelining Missing Data Directly into BLASTP Searches insert into yeast_human_homolog select 'YDR099W‘ Yeast_ORF_name, t.t_seq_id Human_refseq, t.expect E_Value from TABLE ( BLASTP_MATCH ( (SELECT orf2seq ('YDR099W') FROM dual), CURSOR (SELECT refseq_id, sequence_string FROM target_db WHERE refseq_id LIKE 'NP_%') ) ) t WHERE t.expect < 0.0001 ;  BLAST in DBMS  Online Data Integration  BLAST Target DB Customization -- Note: Iterate through Yeast ORF Names to perform batch BLAST.

Mission Impossible: Accomplished SELECT a.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_value FROM yeast_interaction a, yeast_human_homolog b, yeast_human_homolog c, y2h_interaction_p d WHERE a.orf_1 = b.yeast_ORF_name and a.orf_2 = c.yeast_ORF_name and ( (b.human_refseq = d.bait_refseq and c.human_refseq = d.prey_refseq) or (b.human_refseq = d.prey_refseq and c.human_refseq = d.bait_refseq) ) ; ORF_1 ORF_2 HUMAN_REFSEQ E_VALUE HUMAN_REFSEQ E_VALUE ------------------------- ------------------------- --------------- ---------- --------------- ---------- YCR002C YHR107C NP_xxxxx1 5.9279E-44 NP_yyyyy1 3.7130E-46 YCR002C YJR076C NP_xxxxx2 5.9279E-44 NP_yyyyy2 1.7807E-48 YJR076C YHR107C NP_xxxxx3 1.9734E-39 NP_yyyyy3 3.7130E-46 YCR002C YHR107C NP_xxxxx4 2.3257E-48 NP_yyyyy4 7.4988E-39 YCR002C YJR076C NP_xxxxx5 2.3257E-48 NP_yyyyy5 1.9734E-39 YJR076C YHR107C NP_xxxxx6 1.7807E-48 NP_yyyyy6 7.4988E-39

Conclusion • Data-driven discovery-oriented bioinformatics framework demands rich bio-specific DBMS support • 10g Regular Expression and BLAST in DBMS features benefit our scientific discovery tasks in interactome studies • Additional enhancements

References • Jake Yue Chen, et al (2003) Initial Large-scale Exploration of Protein-protein Interactions in the Human Brain.Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003. Stanford University, Stanford, CA. • Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Extracting Biological Information from System-scale Protein Interactome Data.Tutorial at the 11th International Conference on Intelligent Systems in Molecular Biology. Brisbane, Australia. • Jake Yue Chen and John Carlis (2003) Similar_Join: Extending DBMS with a Bio-specific Operator.Proceedings of the 2003 ACM Symposium on Applied Computing. Melbourne, Florida. • Jake Yue Chen and John Carlis (2003) Genomic Data Modeling.Information Systems, Vol 28, issue 4: p287-310.

Oracle10g Search Features for Biological Discovery Gap