CS 466 Introduction to Bioinformatics

CS 466Introduction to Bioinformatics Saurabh Sinha

What is the course about? • Algorithmic concepts, applied to sample (toy) problems in “bioinformatics” • Follows the text book • “Real” bioinformatics research • Follows the best journals • Not about practical training in the use of popular bioinformatics software

Grading • Assignments: 40% • About one every two weeks • Mid Term: 30% • Final: 30%

Expectations • Some programming skills • Any programming language is fine

Administrative Details • Instructor: • Saurabh Sinha • Room 2122, Siebel Center • Email: sinhas@illinois.edu • Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC • Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC • Web site: http://veda.cs.uiuc.edu/courses/fa08/cs466/ • Welcome to sit in, if not taking for credit

Text books • Jones and Pevzner: required • Durbin et al.: recommended

Other course • CS 591 BIO: weekly seminar on biomedical informatics • Fridays at 10:30 am.

Motivating bioinformatics

Special issue of journal Science, July 1, 2005.

>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Many of the most profound scientific questions of today are within the realm of bioinformatics research

“Why do humans have so few genes ?”

A simple organism Environmental signal GENE Response (protein) Raw materials

A simple organism GENE1 GENE2 GENE3

GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10 A simple organism

GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10 A complex organism Complex circuit of interactions

Regulatory networks • This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity) • Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information

Decoding the regulatory network • Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance • Statistics on DNA sequences and words • Knowing these can tell us about edges in the regulatory network

Decoding the regulatory network • An example computational problem: • Given a string of length 10,000 over the alphabet {A,C,G,T} • Count the number of occurrences Nwof every 6 letter word w • Are there specific words that occur more frequently than expected by chance?

Decoding the regulatory network • What is expected by chance? • What is “more frequently”? • Interesting mathematical questions • The Moby Dick example

Decoding the regulatory network • What is expected by chance? • What is “more frequently”? • Interesting mathematical questions • The Moby Dick example • This is called “motif finding”, and helps decode the regulatory network!

Comparing DNA • Humans are about 99.9% identical to each other, DNA-wise. • How do we know that ? • Compare the genome of two individuals. • The computational problem: Are two sequences similar ?

Sequence alignment • Why is this a problem? • The two sequences will differ by “substitutions”, “insertions” and “deletions” accumulated during evolution • The comparison algorithm has to be robust to such possibilities. • A special technique called “dynamic programming” does all this, and is “efficient”

Sequence alignment • Why should we care? • Compare human genome with fish. You’ll see some portions that are highly similar. • These “conserved” portions are often genes … • … or regulatory sequences! The regulatory network again.

On counting genes • The original question was “Why do humans have so few genes?” • How do we know how many genes there are in the human genome ? (And where they are in the genome) • Experiments can be designed, but bioinformatics plays a major role

Gene prediction • The task of predicting the locations of genes in a new genome (“annotation”) • Gene prediction software • The more sophisticated ones use “Hidden Markov models” (HMM) and multiple species comparison

HMM for Gene Prediction What is this graph? It captures the “architecture” of a gene. It translates into a “probabilistic model”. It leads directly to a gene finding algorithm http://researchweb.watson.ibm.com/journal/rd/453/birney.html

“What controls organ regeneration ?”“How does a single somatic cell become a whole plant ?”

Development and Regeneration • Developmental biology • The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult • A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?

Regulatory networks again • Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo • Hidden Markov models used to detect such regions • Multiple species comparison aids discovery

“How did cooperative behavior evolve?”

Social behavior and bioinformatics? • Social behavior in honey bees • Young worker bees are nurses in the hive; older ones go out to forage • This behavioral maturation is determined by needs of colony. What is the genetic basis of this ?

Social behavior and bioinformatics • Illinois team scanned the genome to understand this (2006) • Regulatory network of social behavior • Statistical tools, such as Hypergeometric test • Machine learning tools such as “support vector machine classification”

Other challenges

Protein structure prediction http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif

Protein structure prediction • Can we predict the 3-D structure of a protein from its amino acid sequence ? • Why ? • One good reason: structure gives clues about function. If we can tell the structure, we can perhaps tell the function • We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !! • Neural networks, a popular technique in computer science, applied to this problem

Metagenomics • Most studies to date are on genomes of one species • A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ? • The Sorcerer II expedition • http://www.sorcerer2expedition.org/version1/HTML/main.htm

Many more challenges • New types of data come due to technological breakthroughs in biology • High throughput data carries unprecedented amount of information • Too much noise • Bioinformatics removes the noise and reveals the truth

Bioinformatics • Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.) • Is about a family of very different problems, all related to biology, all related to each other • How can computers help solve any of this family of problems ?

Bioinformatics and You • You can learn the tools of bioinformatics • These tools owe their origin to computer science, information theory, probability theory, statistics, etc. • You can learn the language of biology, enough to understand what the problems are • You can apply the tools to these problems and contribute to science

CS 466 Introduction to Bioinformatics

CS 466 Introduction to Bioinformatics

Presentation Transcript

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to BioInformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

CS 5263 Bioinformatics CS 4593 AT: Bioinformatics

CS 177 Introduction to Bioinformatics Fall 2004

Introduction to Bioinformatics

CS 5263 Bioinformatics CS 4593 AT: Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics