1 / 37

Automated Discovery in Biological Sciences

Automated Discovery in Biological Sciences. Erika Timar. Central Dogma of Molecular Biology. Definition of Gene. The fundamental physical and functional unit of heredity, responsible for specific traits such as eye color

carys
Télécharger la présentation

Automated Discovery in Biological Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Discovery in Biological Sciences Erika Timar

  2. CentralDogma of Molecular Biology

  3. Definition of Gene • The fundamental physical and functional unit of heredity, responsible for specific traits such as eye color • A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). • a segment of DNA that is involved in producing a polypeptide chain; it can include regions preceding and following the coding DNA as well as introns between the exons • The functional unit of of DNA (deoxyribonucleic acid). Genes are segments of chromosomes found in the nucleus of cells. This hereditary information usually directs the formation of a protein • A natural unit of the hereditary material, which is the physical basis for the transmission of the characteristics of living organisms from one generation to another

  4. Causes of the rising need for Computational Discovery in Biology • Recent technologies have cause a exponential explosion of the information available • PCR • Microarrays

  5. Microarray GREEN represents Control DNA, from normal tissue which is hybridized to the target DNA.RED represents Sample DNA, from diseased tissue which ishybridized to the target DNA.YELLOW represents a combination of Control and Sample DNA, where both hybridized equally to the target DNA.BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA.

  6. Growth in Data

  7. Current Computational Problems in Biology • Gene detection • Gene function • Protein structure • Protein function • Evolutionary relationships • Biomolecular pathways

  8. Probabilistic Methods • Identifying gene modules and gathering information from them • Module Networks of regulation • Conditional expression modules in cancer • Evolutionarily conserved networks

  9. Identifying regulatory modules • Goal to predict functions for regulators their targets and the conditions under which this regulation occurs • Regulatory module- set of genes that are regulated in concert as a function of the expression level of a small set of regulators • Regulation in Biology is diverse

  10. Negative Feedback

  11. Pathways can be highly complex • http://www.biocarta.com/genes/PathwayGeneSearch.asp?geneValue=g

  12. Module Networks: identifying regulatory modules • Input- gene expression data set of 2355 genes in 137 arrays and a large precompiled set of candidate regulatory genes for Saccharomyces cerevisae (yeast) was used by Segal et al

  13. Process • Algorithm searches for partition of genes and for a regulation program • Iterative procedure with 2 steps • Searches for regulation program for each module • Reassigns each gene to the module whose data best fits model proposed • Bayesian score to evaluate fit and Expectation Maximization algorithm to search for model with highest score

  14. Process Flowchart

  15. Results • From input of stress data set program inferred 50 modules which were then evaluated using external data sources to ensure that the gene products and regulation products were correct • Further three hypotheses of uncharacterized regulators were examined and were validated using experiments followed by statistical analysis

  16. Cancer • Typical cause malfunction in cell’s regulatory ability • Prevention of cell death • Over-proliferation • Finding similarities gives targets for medications

  17. Conditional activity of expression modules in cancer • Input Cancer compendium of 1975 microarrays containing 14145 genes and spanning 22 tumor types • Preprocessing- division into gene sets • Process- statistical analysis of gene-set pairs followed by hierarchical clustering which are tested for consistency and then inferred into modules

  18. Hierarchial Clustering

  19. Hierarchial Clustering

  20. Results • Identification of 456 modules spanning different processes and functions Including • Similarities in hematologic tumors and hepatocellular carcinoma • Acute leukemia • Osteoblastic tumors- tumor proliferation and metastasis

  21. Conserved Genetic Modules • Input- 3182 microarrays from humans, flies, worms and yeast • Process- orthologs identified with BLAST to define metagene. • Statistically computed co-expression of metagene- pairs • Combined all paired metagene into networks • Results- network contained 3416 metagene connected by 22163 expression interaction which were confirmed through other statistical and laboratory means

  22. Protein Function • Approaches • Sequence Classification • Nearest neighbor • Motif (amino acid sequences) • Groups of motifs called fingerprints • Profiles- position scoring based on HMM or MSA • Structural Classification • Tools • Local multiple sequence alignment – MEME • Combinatorial approach

  23. Discovery of Motif-based Protein Function Classifiers • A data-driven approach using machine learning to discover rules for assigning protein sequences to functional families on the basis of the presence or absence of specific motifs or combinations of motifs.

  24. Method • Input- Prosite and MEME protein data used for test sets (80% used to train) • Process- Using family of decision tree induction algorithms create a decision tree that is then translated into rules • Uses a greedy procedure discussed in class • Post-pruning to compensate for any over fitting that may have occurred.

  25. Process Flowchart

  26. Results

  27. Results • Results measured in terms of accuracy, precision and recall • MEME- single-best better in precision and comparable in accuracy but worse in recall • Prosite- formed same pattern as MEME but did not have as good a fit • MEME based decision tree outperform Prosite • Clans outperform single best motifs • Program could group functionally important structures based on combination of motifs

  28. General Automated Discovery • Goal- develop an autonomous discovery system that peruses large collections of data to find hypotheses that are interesting enough to warrant the expenditure of laboratory resources and subsequent publication. • HAMB- prototype discovery program with domain-independent heuristics that guide the program’s choice of relationships in data that are potentially interesting

  29. HAMB • an agenda- and justification-based framework • consists of an agenda of tasks prioritized by their plausibility • RL- an inductive generalization program generates plausible hypotheses • Each task has justification called reasons and each reason must have a strength • Tasks are performed using heuristics

  30. Algorithm • Discovery cycle- Loop (top-level control) (1) calculate the plausibilities of the tasks (2) select the task with the greatest plausibility (3) perform the task At the end of each iteration of this loop (called a discovery-cycle), a stopping condition • At end of discovery cycle stopping condition is checked • the plausibility of all tasks on the agenda falls below a user-specified threshold • or the number of completed discovery cycles exceeds a user-defined threshold. • Further deadlocks are looked for and if found broken by proceeding to next most interesting task

  31. X-ray Crystallography • a technique which the pattern produced by the diffraction of x-rays through the closely spaced lattice of atoms in a crystal is recorded and then analyzed to reveal the nature of that lattice Crystallized DNA micrograph Davidson/FSU

  32. Attributes of Macromolecules • The attributes in our augmented dataset include: • macromolecular properties — macromolecule name, macromolecule-class name, and molecular weight; • experimental conditions — pH, temperature, crystallization method, macromolecular concentration, and concentrations of chemical additives in the growth medium • characteristics of the grown crystal (if any)- descriptors of the crystal’s shape, for example, crystal-form, and space-groups-description, and its diffraction-limit (which measures how well the crystal diffracts x-rays).

  33. Results

  34. Verification • Some information in categories II and III is not novel. • It is interesting because some of the discoveries are known techniques in X-ray crystallography and this verifies discoveries made by HAMB

  35. Heuristics • The general heuristics in HAMB can be divided into three classes: (1) heuristics that select rule-induction targets and other goals worth pursuing, (2) heuristics that keep an item’s properties and relationships sufficiently up-to-date, (3) heuristics that reference domain-specific properties to improve the quality of reported discoveries.

  36. Other applications and evaluations • Results of another study carried out in domain of 930 cases of patients in rehabilitation after a medical disability, such as stroke or amputation also showed promising results • Extensive evaluation of features of HAMB was carried out by Livingston et al • Domain independent heuristics and user- modified parameters allow flexibility needed for biological discovery

More Related