Computational Molecular Biology

Computational Molecular Biology Bin Liu Intelligent Computing Research Center TEL: 18038100727 bliu@insun.hit.edu.cn binliu@hitsz.edu.cn

Before we start • Course name: Computational Molecular Biology • Instructor: Bin Liu • Office hours: by appointment, Office: C303B; • Evaluation: attendance and presentation (30%); projects and report (30%); examination (40%) • Class hours: 32; Credits: 2 • Object: students for master degrees of Computer Science and related majors. • course website: http://bioinformatics.hitsz.edu.cn/course/

Joke?

Why should we study this course? • To understand ourselves • Most of the biologists don’t know computer science. Most computer scientists don’t know biology. • For study • Very easy to find a position in top universities. • For jobs • Jobs in academic • Jobs in industry.

Referencesnot limited to Dan E. Krance and Michael L. Raymer, Fundamental Concepts of Bioinformatics Marketa Zvelebil, Jeremy O. Baum. Understanding bioinformatics Carlos Setubal, Joao Meidanis, Introduction to Computational Molecular Biology

Definitions • Biology easily has 500 years of exciting problems to work on. -- Donald E. Knuth (高德纳), Professor Emeritus of The Art of Computer Programming at Stanford University • Names: • 1 Bioinformatics: an interdisciplinary field that develops and improves on methods for storing, retrieving, organizing and analyzing biological data. A major activity in bioinformatics is to develop software tools to generate useful biological knowledge. • 2. Computational Biology: involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems • Participants in fields: • 1. Computer Science: (1)algorithm; (2) AI; (3) database • 2. Biological Science • 3. Mathematics

Computational Molecular Biology Definition: development and use of mathematical and computer science techniques to help solve problems in Molecular Biology. Biologists: creators and ultimate users of the data Scientists from mathematics and computer science: sheer size and complexity of the data. Techniques Databases: new database models to record changes Pattern recognition: to understand molecular sequences; AI algorithms preface

preface • History: • 1953: structure of DNA was unraveled⇒ molecular biology with tremendous advances ⇒ (1)a huge amount of data (the papers and database) GenBank, EMBL, DDBJ • (2)data becomes more complicated

preface The growth rate comparison between protein sequence and structure data Protein sequence Unbalanced Protein structure

preface Can Biology Help Computing? • Computational techniques inspired by biology: • Neural network (artificial intelligence) • Genetic algorithm • A new driver of computer science: • Better hardware (supercomputers) • New data representation • New driver for algorithm development • Develop new theoretical framework: • DNA computing • Network communication, ant colony algorithm (communication between ants)

This course: To present a representative sample of computational problems in MB Efficient algorithms: for above problems algorithms Definition: a step-by-step procedure that tries to solve a certain well-defined problem in a limited time bound Efficient algorithms: they should not take “too long” to solve a problem, even a large one. E.g., sequence comparison ⇒Chapter 2 preface

Why does computation work? • Who invented the first digital computer? • Analog signals get degraded over time • Digital information can be propagated unaltered • The cell is mixture of analog and digital components • The digital molecules of life • DNA: inherit genetic information across generations • RNA: message temporary information within the cell • Protein: execute molecular processes as dictated in code • Properties of each molecule tailored to its role • DNA: Highly stable, protected, self-complementary • RNA: Quickly degraded, single-stranded, mobile • Protein: Versatile code (nX20), complex 3D structure

History of bioinformatics • Dr Hua A. Lim created the word “Bioinformatics” in 1987.

History of bioinformatics

History of Bioinformatics 1950s, the first period • A=T，G=C in DNA were discovered in 1949 • Paulingand Corey discovered the α and β structures of protein sequences in 1951 • Watsonand Crickproposed the DNA structure in 1953 • The first bioinformatics meeting was help in USA, 1956”

History of bioinformatics 1960s, 1970s, second period. The basic concept of bioinformatics：sequence comparison. • Margret Dayhoff • Collecting protein family data, • In 1970s, PAM(Percent Accepted Mutation matrices) was proposed。 • Needleman & Wunsch：In 1970，sequence comparison algorithm。

History of bioinformatics 1980s. • EMBL, Genbank, DDBJ • Smith & Waterman（algorithm of local alignments） • Pearson &Lipman FASTA tool.

History of bioinformatics 1990s • Human Genome Project, HGP • Other genome projects（Gemone projects）: Mus. Musculus（家鼠）, C.elegans (线虫）,,… • Lipmandeveloped the BLAST tool and later PSI-BLAST.

Bioinformatics in China • The research started at the early time point Start in the end of 1960s The first bioinformatics center was established in Peking university life science department in 1996

Bioinformatics websites

National Center for Biotechnology Information（NCBI） http://www.ncbi.nlm.nih.gov/ Databases, bioinformatics tools and software.

European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/

DDBJ (DNA Bank of Japan)： http://www.ddbj.nig.ac.jp/

Sanger：http://www.sanger.ac.uk Tools

http://www.isb-sib.ch/

Peking University Center for Bioinformatics：http://www.cbi.pku.edu.cn 是EMBnet和亚太生物信息网络（APBioNet）的中国节点。

上海生命科学研究院生物信息中心： http://www.biosino.org/

香港中文大学生物信息中心（HKBIC）： http://www.hkbic.bch.cuhk.edu.hk/

台湾分子信息中心： http://bioinfo.life.nctu.edu.tw/index.php

http://www.chgc.sh.cn/

Webs for some biological information • http://emuch.net/(小木虫) • http://www.dxy.cn/(丁香园) • http://www.bioon.com/ (生物谷) • http://www.bio-soft.net(生物软件)

Chapter 1 fundamental concepts from molecular biology: basic structure and function of proteins and nucleic acids mechanisms of molecular genetics most important laboratory techniques for studying the genome of organisms an overview of existing sequence databases. Chapter 2 strings and graphs: two of the most important mathematical objects used in the course. Algorithms: A brief exposition of general concepts; their analysis; NP-completeness Course overview

Chapter 3 sequence comparison two-sequence problem: classic dynamic programming algorithm more general cases of the problem: extensions of algorithm: multiple-sequence comparison problem programs used in database searches some other miscellaneous issues Chapter 4 phylogenetic tree Proteins and nucleic acids also evolve through the ages: an important tool ⇒phylogenetic tree help understand protein function some of the mathematical problems related to phylogenetic tree reconstruction simple algorithms: for certain special cases Course overview

Chapter 5 genome rearrangements An important new field: some organisms are genetically different, not so much at the sequence level, but in the order in which large similar chunks of their DNA appear in their respective genomes mathematical models Chapter 6 molecule's structure prediction methods that try to predict a molecule's structure based on its primary sequence RNA structure prediction: dynamic programming algorithms protein structure prediction: difficulties protein threading: attempts to align a a protein sequence with a known structure Course overview

Chapter 7 Data Driven Machine Learning Approaches for Bioinformatics Course overview

Computational Molecular Biology