640 likes | 842 Vues
Bioinformatics Research Centre University of Glasgow. David Gilbert www.brc.dcs.gla.ac.uk Department of Computing Science , University of Glasgow. Bio informatics. Bio informatics. Bioinformatics. Bioinformatics. Bio - Molecular Biology. Informatics - Computer Science.
E N D
Bioinformatics Research CentreUniversity of Glasgow David Gilbert www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow BRC Glasgow
Bioinformatics Bioinformatics Bioinformatics Bioinformatics • Bio - Molecular Biology • Informatics - Computer Science • Bioinformatics - the study of the application of • - molecular biology, computer science, artificial intelligence, statistics and mathematics • - to model, organise, understand and discover interesting information associated with the large scale molecular biology databases, • - to guide assays for biological experiments. • (Computational Biology - USA). BRC Glasgow
Maths &Stats Computing PhysicalSciences Life sciences Bioinformatics in context -a new discipline? ?Psychology? BRC Glasgow
Bioinformatics in context (applications) BRC Glasgow
How can we analyse the flood of data ? Data: don't just store it, analyze it ! By comparing sequences, one can find out about things like • How organisms are related & evolution • How proteins function • Population variability • How diseases occur BRC Glasgow
Separating sheep from goats... BRC Glasgow
Dirty data? Big Horn Sheep[Ovis canadensis] The Big Horn Sheep [Ovis canadensis] is a large North American species with a brown coat, which turns to bluish-grey in winter. It is so named from the size of the horns of the ram, which often measure over 1 m/3.3 ft round the curve. Classification: Ovis canadensis is in family Bovidae, order Artiodactyla BRC Glasgow
Termination (stop) TATA box start control statement control statement gene Data, information, knowledge … • data : nucleotide sequence • information : where are the “genes”. Found using classifier, pattern, rule which has been mined/discovered • knowledge : facts and rules • If a gene X has a weak psi-blast assignment to a function F • and that gene is in an expression cluster • and sufficient members of that cluster are known to have function F, • then believe assignment of F to X. BRC Glasgow
Some projects at theBioinformatics Research Centre BRC Glasgow
Rat-Mouse-Human BRC Glasgow
IndexingEla Hunt ela@brc.dcs.gla.ac.uk • String indexing structures can be used to index DNA, proteins, XML and phylogenetic trees • All data is read once, index in created on disk • Index reduces the search space of the query (we read a % of disk only) BRC Glasgow
Distributed databases and computationCardiovascular Functional Genomics • -£5.4 million project, 5 UK Universities: Glasgow, Leicester, Edinburgh, Oxford, Imperial; + Maastricht • Led by Clinicians • Combined studies: • scientific models of disease (Rat) • parallel studies of patients • large family and population DNA collections • 3 pronged approach • Targeted transcript sequencing • Microarray gene expression profiling • Comparative genome analysis. • Data generated at each of the 5 sites & made available for analysis: • Issues of distributed data and computation. • Mapping gene sequences Rat Mouse Human • an added layer of complexity in the computation. BRC Glasgow
Public curateddata Shared data Glasgow Edinburgh Leicester Oxford Netherlands London Wellcome Trust: Cardiovascular Functional Genomics BRC Glasgow
BRIDGES: BioMedical Research Informatics Delivered by Grid Enabled Services • National e-Science Centre, Bioinformatics Research Centre, IBM UK Life Sciences • Incrementally develop and explore database integration over 6 geographically distributed research sites within the framework of the large Wellcome Trust biomedical research project Cardiovascular Functional Genomics. • Three classes of integration will be developed to support a sophisticated bioinformatics infrastructure supporting: • data sources (both public and project generated), • bioinformatics analysis and visualisation tools, • research activities combining shared and private data. • The inclusion of patient records and animal experiment data means that privacy and access control are particular concerns. • An exploration of index factories accelerating sequence processing will test the hypothesis that the Grid makes a new class of e-Science indexes feasible. Both OGSA-DAI and IBM DiscoveryLink technology will be employed and a report will identify how each performed in this context. BRC Glasgow
Functional Genomics • ~44,000 GENES • ~33% OF GENES HAVE UNKNOWN FUNCTION BRC Glasgow
Ali Al-Shahib Chao He, Mark Girolami Solution…… • Solve the problem of the twilight zone (sequence alignments below 30% sequence identity) • How? • Predict protein function using an alternative method to BLAST: • Predict protein functional class from sequence, structural and phylogenetic features using machine learning • Combination of these (computationally and statistically) would provide the biologists like yourselves with the most accurate functional prediction of proteins that fall in the twilight zone. BRC Glasgow
Human gene duplication Human Mouse Reptiles + Birds Mouse Lungfish Human Teleosts happened somewhere here Mouse Sharks & Rays Lamprey Lamprey Molecular Evolution: A Phylogenetic Approach Rod Pager.page@bio.gla.ac.uk Locating genome duplications Q: did one or more genome-wide events affect all gene families? BRC Glasgow
TOPSProteintopologyDavid Gilbert, Juris Viksna, Gilleain Torrance (BRC, Glasgow),David Westhead and Ioannis Michalopoulos (Leeds)BBSRC/EPSRC funded BRC Glasgow
Pattern search: TIM Barrel BRC Glasgow
Structure comparison 2bop (probe) against (subset of) CATH BRC Glasgow
TOPS diagram (graph) (v.fast) (slower) Pairwise comparison to structures in database Matches to motif library TOPS comparison server: www.tops.leeds.ac.uk PDB file BRC Glasgow
Protein design Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman,1 Gautam Dantas,1 Gregory C. Ireton,4 Gabriele Varani,1,2 Barry L. Stoddard,4 David Baker1,3 “A major challenge of computational protein design is the creation of novel proteins with arbitrarily chosen three-dimensional structures. Here, we used a general computational strategy that iterates between sequence design and structure prediction to design a 93-residue /ß protein called Top7 with a novel sequence and topology. Top7 was found experimentally to be folded and extremely stable, and the x-ray crystal structure of Top7 is similar (root mean square deviation equals 1.2 angstroms) to the design model. The ability to design a new protein fold makes possible the exploration of the large regions of the protein universe not yet observed in nature.” 1 Department of Biochemistry, University of Washington, Seattle, WA 98195, USA.2 Department of Chemistry, University of Washington, Seattle, WA 98195, USA.3 Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA.4 Division of Basic Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA Science. 2003 Nov 21;302(5649):1364-8 BRC Glasgow
Protein design Generation of starting models. “The target structure for the de novo design process can range from a detailed backbone model to a back-of-the-envelope sketch.” “Because we aimed to create a novel protein fold, we selected a topology not present in the PDB according to the Topology of Protein Structure (TOPS) server (17).” BRC Glasgow
Use of TOPS for protein design User = bkuhlman@email.unc.edu Submitted at 20:29:51 on 3/06/03Structure code = top7a type = PDB (user declared), Database = atlasDetails of sheets etc (including all connected SSEs): Sheet: [6,7,4,1,2]======================================================Domain Code RankComparison time : 43 sectop7a target_query 01bbi00 4.10.100.10.1 71pi200 4.10.100.10.1 71sro00 2.40.29.10.1 71atx00 2.20.20.10.1 92sh100 2.20.20.10.1 91vcc00 3.30.66.10.1 111hpm02 3.10.140.10.1 121csp00 2.40.50.40.1 132snv01 2.40.10.20.3 133tss02 2.40.50.50.3 131bcpF0 2.40.50.50.2 141bovA0 2.40.50.30.2 141tle00 2.10.25.10.1 141cdb00 2.60.40.10.1 151ckmA3 4.10.87.10.1 151kxf01 2.40.10.20.3 151svpA1 2.40.10.20.3 152pkaX0 2.40.10.20.1 151apo00 2.10.25.10.6 161ate00 2.10.40.10.1 161aww00 2.30.30.10.1 161cuk01 2.40.50.80.1 16 Top7a NEEheEC 1:2A 1:4A 2:4R 4:6R 4:7A 6:7A 1:4R 4:6R BRC Glasgow
Use of TOPS for protein design BRC Glasgow
Systems biology – some definitions • Systems biology is the study of all the elements in a biological system (all genes, mRNAs, proteins, etc) and their relationships one to another in response to perturbations. • Systems approaches attempt to study the behaviour of all of the elements in a system and relate these behaviours to the systems or emergent properties BRC Glasgow
A Framework for Systems Biology(Ideker, Galitski & Hood, 2001) • Define all of the components of the system • Systematically perturb and monitor components of the system • Reconcile the experimentally observed responses with those predicted by the model • Design and perform new perturbation experiments to distinguish between multiple or competing model hypotheses BRC Glasgow
New database technologies for storing the output from high-throughput biological experimentsAndrew Jones • Proteomics – study the set of proteins expressed in a sample • Complex, variable output: • High-Resolution images • Numerical data generated by lab. equipment and software • Human Annotation • The data is not suitable for storage in a standard relational database • Storage, retrieval and exchange of data is important • XML (Extensible Markup Language) is being investigated for storing such data BRC Glasgow
Maintained by National Library of Medicine • Free of charge, since 1997 • > 10 million references since 1971 • > 4000 biomedical journals • > 80% in English • > 80% have an abstract "Biochemical Network Data Mined from Scientific Texts" Te Ren (PhD student) with CXR Biosciences. BRC Glasgow
L-aspartate aspartate biosynth. aspartate biosynth. aspartate kinase II/homoserine dehydrogenase II aspartate kinase II/homoserine dehydrogenase II ATP ATP 2.7.2.4 2.7.2.4 expression codes for catalyzes catalyzes ADP ADP metBL operon metBL operon L-Aspartate-4-P aspartate semialdehyde deshydrogenase aspartate semialdehyde deshydrogenase asd asd metL metL NADPH; H+ NADPH; H+ expression codes for catalyzes catalyzes 1.2.1.11 1.2.1.11 NADP+; Pi NADP+; Pi L-Aspartate semialdehyde lysine biosynth. lysine biosynth. metB metB NADPH;H+ NADPH;H+ catalyzes catalyzes 1.1.1.3 1.1.1.3 NADP+ NADP+ L-Homoserine threonine biosynth. threonine biosynth. metA metA homoserine-O-succinyltransferase homoserine-O-succinyltransferase Succinyl SCoA Succinyl SCoA represses represses expression codes for 2.3.1.46 2.3.1.46 catalyzes catalyzes HSCoA HSCoA represses represses represses represses aplha-succinyl-L-Homoserine Holorepressor cystathionine-gamma-synthase cystathionine-gamma-synthase L-Cysteine L-Cysteine expression codes for catalyzes catalyzes 4.2.99.9 4.2.99.9 Succinate Succinate is part of is part of represses Cystathionine cystathionine-beta-lyase cystathionine-beta-lyase metC metC H2O H2O 4.4.1.8 4.4.1.8 expression codes for catalyzes catalyzes Aporepressor Aporepressor Pyruvate; NH4+ Pyruvate; NH4+ represses represses Homocysteine metE metE Cobalamin-independent homocysteine transmethylase Cobalamin-independent homocysteine transmethylase 5-Methyl THF 5-Methyl THF expression codes for expression codes for catalyzes catalyzes 2.1.1.13 2.1.1.13 2.1.1.14 2.1.1.14 represses represses up-regulates up-regulates up-regulates up-regulates THF THF expression codes for catalyzes catalyzes metJ metJ Cobalamin-dependent homocysteine transmethylase Cobalamin-dependent homocysteine transmethylase metH metH expression codes for inhibits inhibits L-Methionine metR metR metR activator metR activator ATP ATP 2.5.1.6 2.5.1.6 Pi; PPi Pi; PPi is part of is part of inhibits inhibits L-Adenosyl-L-Methionine Data complexityMethionine Biosynthesis in E.coli
DNA chip experiment Transcription profiles Visualization Clustering Clusters of co-regulated genes Functional meaning ? Pathway extractionin metabolic reaction graph Putative metabolic pathways Matching against metabolic pathwaydatabase Known pathways Novel pathways Biochemical networks • Pathway navigation • Pathway comparison • Pathway motif discovery • Pathway simulation • High-level abstraction inferred from low-level descriptions • Novel pathways from gene expression experiments
L-Aspartate A Software System forPattern Matching and Motif Discovery in Biochemical NetworksSebastian Oehmoehms@dcs.gla.ac.uk L-Aspartate 2.7.2.4 2.7.2.4 L-aspartyl-4-P L-aspartyl-4-P 1.2.1.11 1.2.1.11 L-aspartic semialdehyde L-aspartic semialdehyde • Design a suitable data model using bipartite graphs • Define patterns and develop algorithms for pattern matching in biochemical networks • Define pathway motifs and develop algorithms for motif searching in biochemical networks • Develop algorithms for automated motif discovery • Develop algorithms to search for the largest common part of two or more biochemical networks • Develop a measure of similarity for pathway comparison 1.1.1.3 1.1.1.3 L-Homoserine L-Homoserine 2.3.1.46 2.3.1.31 Alpha-succinyl-L- Homoserine O-acetyl-homoserine 4.2.99.9 Cystathionine 4.4.1.8 4.2.99.10 Homocysteine Homocysteine 2.1.1.14 2.1.1.14 L-Methionine L-Methionine 2.5.1.6 2.5.1.6 S-Adenosyl-L-Methionine S-Adenosyl-L-Methionine S.cerevisiae E.coli BRC Glasgow
Biochemical Pathway Simulator A Software Tool for Simulation & Analysis of Biochemical Networks Muffy Calder David Gilbert Walter KolchKeith van Rijsbergen Brian RossOliver Sturm DTI ‘Beacon’ project, £0.9M, 4 years BRC Glasgow
Not a toy problem! Experimental Data Analysis BRC Glasgow
Literature Apoptosis Abstract model Complexity: real bioinformaticsClosing the loop from wet lab to in-silico Human feedback (in-the-loop) Simulator Database MAPK Lab MAPK Analysis Web portal DATA User Interface Pathway Editor Rules Database Apoptosis Text miner Simulator Concurrency theory Bioinformatics Tools, database, interface Bio Lab/Literature BRC Glasgow
Proliferation (Cell division) vs Differentiation (Neurite outgrowth) in PC12 cell model • NGF (50 ng/ml) • Differentiation into nerve cell type EGF (50 ng/ml) Proliferation cell division stimulated without neurite outgrowth neurite outgrowth BRC Glasgow
Receptor Receptor Receptor cAMP cAMP Ras Ras Ras PKA PKA B-Raf Raf-1 Raf-1 Raf-1 MEK1,2 MEK1,2 MEK1,2 ERK1,2 ERK1,2 ERK1,2 Cell growth Growth arrest Cell growth Raf-1 is expressed in all cells, and its activation induces ERK activation Many receptors that activate ERK also elevate cAMP levels leading to activation of PKA. PKA inhibits Raf-1 and blocks ERK activation However, cAMP induces activation of B-raf. In cells which express B-raf, cAMP activates the ERK pathway despite of Raf-1 inhibition. Dynamic Behaviour of the Network BRC Glasgow
Frequency Conversation Conversation Cell 2 Cell 1 Cell 2 Frequency Cell 2 Conversation Conversation Mobility Sometimes a signal sent in a communications network can change the connections or topology of that network. In the example below, a cell-phone is being carried out of range of Cell 1. The base station must send the frequency of the appropriate new Cell (Cell 2) to the phone. The phone connects to Cell 2 and discards its previous link to Cell 1. Base Base BRC Glasgow
Ras Ras Raf GDP GTP GDP GTP SoS SoS In biochemical networks, a protein can be granted or denied the opportunity to interact with certain other molecules by exchange factors, effectively changing the network topology dynamically. In the example below, the protein Ras is bound to a molecule of GDP, which renders Ras inactive. A molecule of SoS can interact with this Ras-GDP complex, causing the GDP to be exchanged for GTP. The Ras-GTP complex is active, permitting interaction with the protein Raf. BRC Glasgow
InputSchemas Reusable Subcomponents of a Solution forOffline Integration of 3rd party Databases Integrator IntegratedDatabase ExtractedLit. Data SchemaTranslator RecordMatcher RecordMerger aMaze DB ConflictResolutionRules MAPKsource data RecordMatchingRules DefaultValues Trans LocalSchemas cAMP PKsource data Cross-ref Index Target Schema • By-products of the total process may correspond to other reusable sub-services • Schema Translation – various schema definition langs are translated into one common, interpretable schema lang. • Record Matching – builds a cross reference index that identifies records about a “same entity” and records the source and location of the matching records. Two or more records may match. BRC Glasgow
Validation Current Bottlenecks in Drug Development Drug target discovery: What is a good drug target? How do we select it? Drug target validation: Does hitting the target change the biological response? Side effects: What else is affected when the selected target is hit? Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? BRC Glasgow
Validation Current Bottlenecks in Drug Development Drug target discovery: What is a good drug target? How do we select it? EMPIRICAL SLOW EXPENSIVE Drug target validation: Does hitting the target change the biological response? Side effects: What else is affected when the selected target is hit? Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? BRC Glasgow
Select targets by defining its topology & function in the regulatory networks. • Validate the target by predicting how the biological response should change. • Predict side effects to allow early and targeted testing. • Predict the optimal drug profile to improve selection criteria. Validation Current Bottlenecks in Drug Development A robust Pathway Simulation Software can help to … Drug target discovery: What is a good drug target? How do we select it? Drug target validation: Does hitting the target change the biological response? Side effects: What else is affected when the selected target is hit? Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? BRC Glasgow
PC12 cell model of neuronal differentiation Target Validation:Predict & test the effect of Raf-1 and B-Raf inhibitors to the biological response to EGF vs. NGF. Lead Compound Selection:Predict & test which inhibitory efficacy is necessary and sufficient to achieve the desired biological response. Validation What we propose … BRC Glasgow
Model of cell behaviour Bionanotechnology & Bioinformatics Fab methodology Nanofab & cell culture Measured cell behaviour Dynamic behaviour Physical substrate Biochemical environment (other cells + biochemicals) Morphology Adhesion Cell shape Gene expression Bioinformatics Proteome Genetic engineering External databases Other pathway data BRC Glasgow
Machine Learning for Bioinformatics • Classification • Clustering • Characterisation • Techniques: • ensemble methods • decision trees • inductive logic programming • pattern discovery • Statistical approaches • SVMs BRC Glasgow
AML acute myeloid leukemia (myeloid precursor) ALL acute lymphoblastic leukemia (lymphoid precursors) Cancer Classification Problem (Golub et al 1999) BRC Glasgow
Gene Expression Profiles ALL AML ALL AML Machine Learning Approach Machine Learning Classifier C4.5 SVM k-NN ANN BRC Glasgow