Information: The Language of Biology

Information: The Language of Biology Gary Strong NSF, ITR Program

Cell Human Language Suggestive Biology-Language Homologies

Goals Leading Toward Predictive Biology Gene Sequence Data Gene Identification Structure Prediction Protein Circuit & Regulatory Network Discovery Biosimulation

Natural Language Processing and Bioinformatics are Already Related • Both NL and biology are faced with data mining over massive amounts of data • Applying NLP tools to biology • Hidden Markov gene finders • Protein grammars to predict function from sequence • Protein circuit extraction from scientific literature • Convergence of biological and language mining • PSI Blast homology searching in genome augmented by medical literature

Model-based GENSCAN Is Best Among HMM Gene Finders Sn = Sensitivity Sp = Specificity Ac = Approximate Correlation ME = Missing Exons WE = Wrong Exons GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html

The Chomsky Hierarchy Language Automaton Grammar Recognition Dependency Biology Recursively Enumerable Languages Turing Machine Unrestricted BaaA Undecidable Arbitrary Unknown Context- Sensitive Languages Linear-Bounded Context-Sensitive AtaA NP-Complete Crossing Pseudoknots, etc. Context- Free Languages Pushdown (stack) Context-Free SgSc Polynomial Nested Orthodox 2o Structure Regular Languages Finite-State Machine Regular AcA Linear Strictly Local Central Dogma From D. Searls

Mildly CSG’s for Structure Modeling • Tree Adjunct Grammars (TAG) have been applied to modeling RNA secondary structures including pseudoknots. • An efficient parsing algorithm for this grammar was developed, and applied to some computational problems concerning RNA secondary structures. • Further, a (-1) frame shift grammar is constructed based on a biological observation that a (-1) frame shift might be caused from some structural features of RNA sequences. • The proposed grammar was used to find candidate sequences for (-1) frame shift in Human spumaretrovirus gag and pol genes. Yasuo UEMURA et al.

Phenylalanine tRNA of Yeast secondary structure tertiary structure Complementary base-pair interactions are shown by long gray bars. Complementary base-pair interactions are shown by short gray bars. Unusual base-pair interactions shown in red (which are crossed dependencies) lead to complex structures common in proteins

Structural similarity of unusual base-pair interactions to long-distance relations in NL

Data Mining for Proteins and Biological Function • Regulatory pathway research is growing exponentially, generating huge amounts of information. • “Cell cycle” and “apoptosis” produced 169,293 and 29,961 hits respectively on PubMed. • Going through articles manually is prohibitively time-consuming. • Investigators can benefit from systematic compilation and integration of 1000’s of pieces of discrete information on signal transduction. • Comparisons across species can be useful for cross validation and as an aid in hypothesis generation about yet undiscovered genes. • Automatic techniques can be applied to the literature with enormous benefit - qualitative model construction, for example. ... staurosporine activates a JNK isoform ... ... STAT4 is not activated by IL-2 ... automatic information extraction ACTIVATION staurosporine + a JNK isoformIL-2 not STAT4

Some chemical reactions in a cell: As many as 10,000 proteins act as regulatory enzymes in mammals.

Including Biological Literature Improves Homology Search Jeffrey T Chang, Soumya Raychaudhuri, and Russ B Altman Stanford Medical Informatics Stanford University

NLP Approaches to Biological Data Modeling • Gene Identification: • Identification of exons, introns, and non-coding regions (approximately 10 exons per gene and 30,000 genes from Human Genome data) • Sequence analysis and annotation of homologies, e.g. sequences that lead to similar 3D structural features that have implications for activity of the protein • Evolutionary relationships across species • Structure prediction: • Context sensitive grammars to predict structure from coding sequences • Potential for modeling dynamic proteins, such as those that exhibit antigenic variability • Computational models of protein evolution, e.g. for structure prediction from SNP sequences • Combinatorial models for non-coding DNA role in regulatory circuits, such as timing of gene co-expression • Circuit Discovery: • Tools for assigning meaning to protein interaction patterns • Template extraction of protein-protein interactions within cells from scientific literature • Construction of signal transduction networks • Prediction of function effects from network disruption

Important Programmatic Dimensions from DARPA NLP Program POV

Components of a Successful Program • Data need to be shared across research community, implying annotation standards. • Objective, community-wide evaluations on common tasks are needed to gauge progress. • Technology transfer is not just an end-product of research, but a constant driver. • Large, challenge problems bring government resources from multiple agencies.

Annotation Standards and Tools

Corpus of Annotations Annotation Annotation Annotation Content Content Content Region Region Region Signal ATLAS Annotation Framework • Associate derived information (annotation “content”) with some region of a signal

Current database projects, such as Cancer Genome Anatomy Project (CGAP) of NCI, rely on unifying data via common UNIGENE ID’s for EST’s. Need to unify data via annotation of genome itself.

Community Evaluations

Source: Pallett, D. Garofolo, J. and Fiscus, J. (NIST) Measurements in Support of Research Accomplishments. Forthcoming, Feb 2000. Communications of the ACM: Special Section on Broadcast News Understanding. Speech Recognition: Progress Research results for some foreign languages lag English, but -- given sufficient resources -- would catch up Difficulty increase with: - Vocabulary size - Style Read vs. Careful vs. Conversational The speech community has consistently reduced word error rate on successively harder problems, as measured by government-funded evaluations

Progress in Information Extraction Names in English Names from audio @ 0%15% word error Names in JapaneseNames in Chinese Relations Question Answering Event extraction There have been many evaluations of natural language components...

There is an established evaluation community focused on structure prediction

Technology Transfer

HLS Human Language Systems Program SLS Program 94-93 95 Dragon Nuance Corporation TIPSTER Project Lockheed Martin TRVS Project 96 IBM Communicator Program 97 98 AT&T OASIS Project 99 BNN 00 Human language technology innovations have launched many transitions

TREC participants from around the world DARPA evaluations draw global participants.

“National Challenge” Problems

Workshop on Language Modeling of Biological Data, Feb. 26-27, 2001 @UPenn • STRUCTURE OF GENES AND GENOMES • How much do strongly model-based, "syntactic" algorithms enhance gene identification and genome characterization? Can general-purpose or domain-specific parsing methods find application in genome analysis? • STRUCTURE OF MACROMOLECULES • What is the practical utility of non-regular stochastic grammars in recognizing RNA secondary structures? Can stronger (as compared to CFG) formal systems be useful in protein structural studies? • PATTERN SEARCH AND ANALYSIS • What do stochastic methods add to sequence search and analysis? Are there uses for recent statistical linguistic methods? Do linguistic methods apply to the analysis of regulatory regions? • INFERENCES FROM GENOMES • Are there lessons from comparative linguistics for comparative genomics? How does phylogenetic reconstruction resemble classical linguistics? How can multiple genomes be used to infer phylogenetic relationships, protein interactions, etc? Are there techniques from linguistics and/or machine learning that might bear on the analysis of gene expression? • DEVELOPMENTS IN PARSING • What are some of the recent developments in statistical parsing in computational linguistics that may be of relevance to CB? • STRONG MODELS OF GRAMMARS (Grammars with structured primitives) • What are some of the structural linguistic aspects motivating the strong models? Computational and stochastic implications. Relevance to topological structures in CB. Are there technologies that could be successfully adopted, patterned after the success of HMM's in biology?

Some workshop recommendations on areas of special emphasis • Tool kit of techniques: dimensionality reduction; discriminative vs. generative models and their combinations; lexicalization; clustering algorithms. • Formal language theory of macromolecules. • Protein fold recognition, comparative modeling, and structural predictions, in general; overlap of search strategies and global energy optimization in structure prediction. • Evolutionary models (“Tree of Life?”).

A Specific Challenge Problem • Although the prion gene sequence does not differ between individuals of a species, the shape it encodes can differ. • Transmissible SE can be transmitted from cattle to humans, crossing the species barrier. • Certain prion forms have an enhanced ability to cross the species barrier, such as from cattle to humans (Nature, 410:162). Can we predict proteins’ multiple, stable conformations and ways to inhibit or promote them?

Conclusions • There are strong parallels between language and biological data affording the development and use of common tools. • Some language technology tools have been highly successful • Other language technology tools are promising • New field of computational sequence analysis (computational biolinguistics, sequence-to-structure technology, …) at intersection? • DARPA-style program elements may be appropriate for bioinformatics programs. • Annotation standards and tools • Community evaluations on common data and tasks • Technology transfer intrinsic to the program • National challenge problems exist and can unify Federal support

Information: The Language of Biology

Information: The Language of Biology

Presentation Transcript

DNA: The Carrier of Genetic Information

Fundamentals in Sequence Analysis 1.(part 1)

Natural Language Processing Applications

PSYCHOLOGY OF LANGUAGE

Chapter 8: Language and Society

Graphs and Graph Theory in Computational Biology

Chinese Language Intermediate 1 Lifestyle/Education and Work

Animal Behavior

Genes, Proteins and Literature Searching

About Plant Biology

Machine Reading

Biology 323 Human Anatomy for Biology Majors Lecture 1 Dr. Stuart S. Sumida

DNA AND THE LANGUAGE OF LIFE

Language

Teaching English Language Learners

Biology EOCT Review

Towards a Top-Level Ontology for Molecular Biology

Albania

NCBI Molecular Biology Resources

Computational Methods in Systems Biology

Higher Biology

BIOLOGY EOC REVIEW