PROGRESS REVIEW Mike Langston’s Research Team

PROGRESS REVIEW Mike Langston’s Research Team Department of Computer ScienceUniversity of Tennesseewith collaborative efforts atOak Ridge National LaboratoryNovember 22, 2005

Team Members in AttendanceBhavesh Borate, Elissa Chesler, John Eblen, Roumyana Kirova, Mike Langston, Andy Perkins, Yun ZhangTeam Members AbsentXinxia Peng, Jon Scharff, Josh Steadmon

Mike Langston’s Progress ReportFall, 2005 • Team Changes New Students: Belma Ford (GST), Peter Shaw (Australia) New Colleagues: Elissa Chesler & Roumyana Kirova (ORNL) Graduating Soon: Xinxia Peng (December), Jon Scharff (May) Moved Collaborators: Jay Snoddy (Vanderbilt) • Recent Conferences/Talks ACiD (England), Dagstuhl (Germany), COCOON (China), Purdue, Supercomputing (Seattle) • Upcoming Visits/Talks RECOMB WS (San Diego), Texas A&M, Carleton (Canada), Göteborg (Sweden), AICCSA-06 (UAE), ACM SAC (France) • Support NIH (John), ORNL (Yun), Science Alliance (Andy) Proposals Outstanding • Sample Projects Eukaryotes: Allergy (Human), Diabetes (Mice), IR (Mice), Neuroscience (Mice), others Prokaryotes: Operon (R. palustris), Shock (Shewanella)

Yun Zhang • Recent conferences/talks • Prepared slides for Cocoon05, China • Presented in SC05 (SuperComputing), Seattle • Upcoming events • Cray MTA (Multithreaded Architecture) Workshop, ORNL • Projects: maximal clique enumeration • Comparisons of multithreaded implementations on • Altix vs. Cray vs. IBM • Cray: Vectorization of for-loops • Implementations on distributed-memory machines • Using MPI vs. Global Arrays • Load-balancing using master/slave vs. peer-to-peer model • Comparison of MPI vs. Multithreaded

Search tree k = 1 k = 2 1 2 3 4 5 6 k = 3 1 5 2 3 k = 4 1 k = 5 A task needed to be transferred from slave1 to slave5 Parallel Clique Enumeration • Object • Minimize data communication vs. maximize balanced load • Dynamic load balancing • Data transfer: peer-to-peer • DLB strategies: master/slave vs. peer-to-peer

a b e c d f g Clique Enumeration • Methods to speed up the computation core • Bit compression to save memory, and corresponding bitwise operations on compressed bitmaps Vertices dense Cliques sparse (a, b, c, d)

Andy PerkinsProjects • Low dose • Allergy • Shewanella • HRT

Microarray Data • Normalization • Filtering low or unchanging expression values • Control spots

Differential Analysis • Cliquification • In a large percent of cliques in one group and few in the other. • Expression • 2-fold change in expression between groups • Correlation • Correlation value >= 0.85 in one group and <= 0.25 in the other.

Differential Analysis Red edge: >=0.85 in dose and <= 0.25 in control Blue edge: >= 0.85 in control and <= 0.25 in dose

Other research • Thresholding • Pearson’s vs Spearman’s • Random graphs

Papers • ``Computational Analysis of Mass Spectrometry Data Using Novel Combinatorial Methods,'' Proceedings, ACS/IEEE International Conference on Computer Systems and Applications, Dubai, United Arab Emirates, March, 2006, with A. Fadiel, M. A. Langston, F. Naftolin, X. Peng, P. Pevsner, H.S. Talor, O. Tuncalp, and D. Vitello. • ``Innovative Computational Methods for Transcriptomic Data Analysis,'' Proceedings, ACM Symposium on Applied Computing, Dijon, France, April, 2006, with M. A. Langston, A. M. Saxton, J. A. Scharff and B. H. Voy.

John EblenClique Analysis Tool Chain • Projects • Gerling Data – NOD mice • Shewanella Data • Three Interesting Problems • Aggregating Maximal Cliques • Thresholding • Biological Analysis of Clique Results

Aggregating Maximal Cliques • The Problem • A great deal of overlap among maximal cliques • Many cliques differ by only a few nodes • Solutions • Paraclique (Dr. Langston) • Nucleated Clique (Jon Scharff) • Clique Difference or “Nonoverlap” • Others

Direct Maximum Clique • Parallel version scales well on Altix supercomputer, shared memory machines • Currently working on base serial code efficiency • Ultimate goal is speed • Best algorithm possible • Smart implementation(s)

Keller 7 Conjecture • Goal is to find or prove nonexistence of 128-clique in Keller 7 graph • Current approach • Found set of 128 nonoverlapping ISs • Currently searching for more • Should greatly reduce search space

Bhavesh BorateThresholding • GO Pairwise Similarity Analysis • Percentage of Cliques with Biological Meaning at each threshold • Confidence Intervals • Graph Properties (Edge Density, Maximal Cliques, Maximum Clique) • Spectral Graph Theory • Bayesian Statistics • Control Spot Threshold verification • Utilization of Info from Pathway Databases • Combinatorial Strategy • Kentucky Windage ;)

Graph of GO-Pairwise Scores v/s Correlation Values Shewanella data

GO Pairwise Similarity Analysis For each pair of genes, we find a GO category X that covers both the genes and has the minimum number of total genes Get a GO score for each pair of genes Accumulate correlation scores in bins 1,0.99,0.98…….0 Average the GO scores of pairs in each bin. Plot.

Pairwise Scores Score for each Clique Get P-value for each Clique For each threshold 0.8:0.01:0.95 At each threshold calculate % Cliques with P-value < 0.01

Updates from Xinxia • Kevin was born in May • Defended in October • Graduating in December • Working on publications • Starting a job in December Thank you all and Keep in touch!

Suman DuvvuruData analysis • Effect of Strain: Currently working on Dr.Brynns mice strain data and I am writing up the code in SAS to see which strain is producing strong correlation in the data. • The problem with microarray data • The numbers of variables is much higher than the number of observations – causes many eigenvalues in the Covariance matrix to be 0 – Correlation matrix is problematic. • Can be corrected using • shrinkage based correlation • Information criteria based methods (using smooth covariance estimators) . • (Implementation of these methods currently in progress)

aa AA Roumyana KirovamRNA expressions and Linkage Gene expression data: N genes, K strains Probe BXD1 BXD2 BXD3 BXD5 BXD8 … 1 4.46 5.30 5.80 5.51 4.90 ... 2 4.10 4.49 4.24 4.06 4.46 ... 3 5.15 4.74 5.04 6.10 5.20 ... 4 6.45 6.03 5.79 6.56 7.32 ... 5 4.06 5.06 4.35 4.09 4.09 ... 12000 4.16 4.06 5.37 5.28 5.31 … Polymorphisms Marker BXD1 BXD2 BXD3 M1 AA aa AA M2 AA AA AA M3 aa AA AA M4 AA aa AA M5 AA AA aa M3000 aa aa AA

Model: QTL mapping

Expressions: 0.46 0.30 0.80 1.51 0.90 Paraclique1 Regulatory C2 Model 1 Model 2 C1 Paraclique2 Clique 2 Regulatory

Principal components Paraclique1 Paraclique2 C2 C2 QTL Model 1 C1 C1 Principal components QTL Model 2

Principal components QTL Model 1 Meta component Common QTL Principal components QTL Model 2

Open questions: • How stable are the paracliques and QTL models if we choose different samples (not the average of the replicates). • generate samples of the data by choosing randomly replicates and build confidence intervals. • fit a multi variance model: Expression ~ Strain + Sample + Strain: Sample • adding covariates in the QTL model to adjust for the gender effect. • Power issues: How many strains, replicates and how many terms in the model. • simulate expression data and calculate power as a function of the sample size. • Parametric vs non-parametric analysis. • Multiple tests adjustments.

Ontological Discovery for Ethanol Research(…the new acronym stinks) Elissa J. Chesler Department of Anatomy and Neurobiology Center for Genomics and Bioinformatics University of Tennessee-Memphis Health Science Center

Ontological Discovery for Ethanol ResearchSPECIFIC AIMS Cocaine & PTZ Audiogenic EtOH Withdrawal • Aim 1: To develop a data archive of ethanol, brain and behavior related gene sets that have been derived both empirically and through literature review. • Aim 2: To develop a tool that allows cross-species, cross-molecule type gene set comparison. • Aim 3: To develop a Web interface to the data archive and analysis that is aimed toward behavioral neuroscientists. ATPases The Seizure Related Phenotype Landscape T4 Pressure Highly related phenotypes share many common mRNA correlates

Ontological Discovery fromPhenotype Centered Gene Sets Phenotypes are operationally defined, based on phenomenology. Gene sets can be empirically associated with phenotypes. But what underlying construct really “IS”? Can we identify it by examining shared biological substrates of related processes. ERGO: Ethanol Related Gene Ontology

AIM 1: Gene set assembly and archive • Attributes of each gene set include: • Type (mRNA, lit, protein) • Species • Free text description • Structured description, e.g. MPO • Source DB (GO, KEGG, WebQTL100) • Associated document (e.g. abstract, publication) • Gene set is broadly defined. • mRNA differential expression • mRNA correlation • Literature review • KO, mutants with trait effects • Search • by gene • by descriptor • by set matching

Aim 2: Analytic tool • Translates gene sets to a common reference species via homology. • Similar to existing tools, but archives more information about gene set • Allows multiple set comparisons (intersection analyses are not limited to two sets). • Percent positive matching allows estimation of the relation of gene sets w/o specific regard to identity of genes. This allows a basis for clustering phenotypes based on gene annotation

GeneKeyDB can be used to generate translation tables across species

Aim 3 Behavioral Neuroscience Friendly interface • Does the world need another boutique? • Making genomics accessible to broader research community. • Text searching to retrieve, e.g. all gene sets related to ‘stress’. • Text mining • Apparatus specific details • OUR GOAL IS TO CREATE A TOOL FOR PHENOTYPIC ANALYSIS, GENES CAN BE A BLACK BOX THAT GET US THERE!

Future DirectionsBleeding Edge From a matrix of set-set correlations estimated by jacquards positive match, can we draw and analyze graphs of gene set relations? From a set of documents associated with overlapping gene sets, can we mine text for frequently occurring terms? e.g. to answer “What term is most commonly occuring in the set of sets extracted by match to expression upregulation in response to handling stress?”

Research challenges • Translation of genes across species: • Homology is not perfect, how do we match when no homologues are found? • Reference Set • What is the “reference set” for category representation analysis when gene sets are drawn from diverse sources? • Lack of comprehensivity of reference sets, e.g. a list of KO mice does not include all genes screened. • Generation and curation of gene sets: establishing meaningful protocols and definitions to increase the quality and utility • Use GenMapp or Stanford models.

Gene set overlap unites diverse phenomena Induction of a research question: “If I antagonize the gene product of consumption correlate in socially isolated monkeys, consumption will decrease.” ontology Gene Expression Correlates of Htr1b Consumption Correlates in RI lines Upregulated in Social Isolation Upregulated in P vs NP Literature On Neuroactive Steroid Synthesis “Hey, you put your social isolation in my NP mice! Yeah, well you put your P mice in my binge drinking!”

PROGRESS REVIEW Mike Langston’s Research Team