1 / 45

IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin. Basic Molecular Biology. All living things are made of Cells Prokaryote, Eukaryote Cell Signaling What is Inside the cell: From DNA, to RNA, to Proteins. Cells. Fundamental working units of every living system.

willow
Télécharger la présentation

IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

  2. Basic Molecular Biology • All living things are made of Cells • Prokaryote, Eukaryote • Cell Signaling • What is Inside the cell: From DNA, to RNA, to Proteins

  3. Cells • Fundamental working units of every living system. • Every organism is composed of one of two radically different types of cells: prokaryotic cells or eukaryotic cells. • Prokaryotes and Eukaryotes are descended from the same primitive cell. • All extant prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution.

  4. Cell Structure • A cell is a smallest structural unit of an organism that is capable of independent functioning • All cells have some common features

  5. Cell Cycle • Born, eat, replicate, and die

  6. The Tree of Life • According to the most recent evidence, there are three main branches to the tree of life. • Prokaryotes include Archaea (“ancient ones”) and bacteria. • Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.

  7. Prokaryotes and Eukaryotes

  8. Signaling Pathways: Control Gene Activity • Instead of having brains, cells make decision through complex networks of chemical reactions, called pathways • Synthesize new materials • Break other materials down for spare parts • Signal to eat or die

  9. An Example -- Cell Cycle Signaling

  10. Cells Information and Machinery • Cells store all information to replicate itself • Human genome is around 3 billions base pair long • Almost every cell in human body contains same set of genes • But not all genes are used or expressed by those cells • Machinery: • Collect and manufacture components • Carry out replication • Kick-start its new offspring

  11. Terminology • Genome: an organism’s genetic material • Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA • Genotype: The genetic makeup of an organism • Phenotype: the physical expressed traits of an organism • Nucleic acid: Biological molecules(RNA and DNA) that allow organisms to reproduce • Amino acid: Organic molecules that build blocks of proteins. • Protein: a large, complex molecule that is essential part of organisms and participates in every process within cells and achieve a particular function.

  12. Three critical molecules • DNAs • Hold information on how cell works • RNAs • Act to transfer short pieces of information to different parts of cell • Provide templates to synthesize into protein • Proteins • Form enzymes that send signals to other cells and regulate gene activity • Form body’s major components (e.g. hair, skin, etc.)

  13. Overview of DNA to RNA to Protein • A gene is expressed in two steps • Transcription: RNA synthesis • Translation: Protein synthesis

  14. DNA the Genetics Makeup • Genes are inherited and are expressed • genotype (genetic makeup) • phenotype (physical expression) • On the left, is the eye’s phenotypes of green and black eye genes.

  15. Central Dogmas of Molecular Biology 1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]

  16. Central Dogmas of Molecular Biology 3) Each side of the double helix faces it´s complementary base. A T, and G  C. 4) Biochemical process that read off the DNA always read it from the 5´´side towards the 3´ side. (replication and transcription). 5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´ e.g. TAGCATCGATCACGT

  17. Central Dogmas of Molecular Biology 6) DNA information is copied over to mRNA that acts as a template to produce proteins. We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes)

  18. Bioinformatics Bioinformatics (computational biology) solves biological problems on the molecular level with the use of techniques including: • applied mathematics • statistics • computer science • artificial intelligence

  19. Biological Data Computer Calculations + Bioinformatics

  20. Central Dogmaof Molecular BiologyDNA-> RNA-> Protein-> Phenotype Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigmfor Bioinformatics->Genomic Sequence->Transcript->Protein Structure->Protein Function Large Amounts of Information Data Management Computer Algorithms Statistical Methods Molecular Biology as an Information Science

  21. Major research efforts • Sequence alignment • Gene finding • Genome assembly • RNA structure prediction • Protein structure prediction • Analysis of gene regulation • Prediction of protein-protein interactions • Modeling of evolution

  22. Major research areas • Sequence analysis • Genome annotation • Computational evolutionary biology • Measuring biodiversity • Analysis of gene expression • Analysis of regulation • Analysis of protein expression • Analysis of mutations in cancer • Analysis of epigenetics in cancer • High-throughput in vivo binding analysis • Prediction of protein structure • Comparative genomics • Modeling biological systems • High-throughput image analysis • Protein-protein docking • Software and tools • Databases • Web services in bioinformatics

  23. Data types • DNA sequences • RNA sequences • Protein sequences • Gene Expression • cDNA, mRNA microarray data • Now tiling array technology • 50 M data points to tile the human genome at ~50 bp res. • Can only sequence genome once but can do an infinite variety of array experiments • Protein-DNA interactions • ChIP-chip, ChIP-seq, ChIP-PET and so on • Phenotype Experiments • KOs • Protein Interactions • Yeast hybrid • Proteomics

  24. Other Integrative Data • Information to understand genomes • Metabolic Pathways • Regulatory Networks • Signaling Networks • Whole Organisms Phylogeny • The Literature (MEDLINE)

  25. GenBank Growth

  26. Exponential Growth of Data Matched by Development of Computer Technology Internet Hosts • CPU vs Disk & Internet • Driving Force in Bioinformatics No.Protein DomainStructures

  27. Types of Relational databases • The Internet can be thought of as one enormous relational database. • The “links”/URL are the primary keys. • SQL (Standard Query Language) • Sybase; Oracle ; Access; (Databases systems) • Sybase used at NCBI. • SRS(One type of database querying system of use in Biology)

  28. XML Database and vocabularies for life science • HTML: Hypertext Markup Language • XML: a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements • BSML: an extensible language specification and container for bioinformatic data. BSML was developed under a 1997 grant from the National Human Genome Research Institute (NHGRI) as an evolving public domain standard for the bioinformatics community

  29. Examples of XML • <?xml version="1.0" encoding="UTF-8"?> • <element_name attribute_name="attribute_value">Element Content</element_name> • <book>This is a book... </book>

  30. Primary Databases • A primary Database is a repository of data derived from experiments or from research knowledge. • Genbank (Nucleotide repository) • Protein DB, Swissprot • PDB (MMDB) are primary databases. • Pubmed (literature) • Genome Mapping databases. • Kegg Database.(pathways)

  31. Secondary Databases • A secondary database contains information derived from other sources. • Refseq (Currated collection of Genbank at NCBI) • UniGene (Clustering of ESTs at NCBI) • GeneID (Unique ID for each Gene at NCBI) • Organism-specific databases are often a mix between primary and secondary.

  32. Biological Databases • Nucleotide databases: • Genbank: International Collaboration • NCBI (USA), EMBL (Europe), DDBJ (Japan and Asia) • A “bank” No curation.. Submission to these database is required for publication in a journal. • Organism specific databases (Quick quiz: Find URLs using search engines) • FlyBase • ChickGBASE • pigbase • wormpep • YPD (Yeast Protein Database) • SGD(Saccharomyces Genome Database)

  33. Protein Databases: • NCBI: More on next week • Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing) • http://www.expasy.ch(latest pay version) • NCBI has the latest free version. • Translated Proteins from Genbank Submissions • EMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT • PIR

  34. Structure databases: • PDB: Protein structure database. • Http://www.rscb.org/pdb/ • MMDB: NCBI’s version of PDB with entrez links. • Http://www.ncbi.nlm.nih.gov • Genome mapping information: • http://www.il-st-acad-sci.org/health/genebase.html • NCBI (Human) • Genome Centers: Stanford, Washington University, UC Berkeley • Research Centers and Universities

  35. Literature databases: • NCBI: Pubmed: All biomedical literature. • www.ncbi.nlm.nih.gov • Abstracts and links to publisher sites for • full text retrieval/ordering • journal browsing. • Publisher web sites. • Biomednet: Commercial site for litterature search. • Pathways database: • KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html • Genome Search and Visualization database: • UCSC Genome Browser (genome.uscs.edu/)

  36. Databases Building, Querying Complex data Text String Comparison Text Search 1D Alignment Significance Statistics Alta Vista, grep Finding Patterns Machine Learning Clustering Data mining Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching (Vision, recognition) Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation Information techniques

  37. Physics Prediction based on physical principles EX: Exact Determination of Rocket Trajectory Emphasizes: Supercomputer, CPU Bioinformatics as New Paradigm forScientific Computing • Biology • Classifying information and discovering unexpected relationships • EX: Gene Expression Network • Emphasizes: networks, “federated” database

  38. Finding Genes in Genomic DNA introns exons promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs Topics -- Genome Sequence

  39. Sequence Alignment How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Patterns How to align more than one sequence and then fuse the result in a consensus representation HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value or An E-value)? Score Distributions Low Complexity Sequences Evolutionary Issues Rates of mutation and change Topics -- Protein Sequence

  40. Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein vs RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity Topics – Structures

  41. Structure Comparison Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library Docking and Drug Design as Surface Matching Topics -- Structures

  42. Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Single-molecule Perspective Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees Topics – Function Genomics

  43. Bioinformatics tools • Sequence comparison (pairwise and multiple alignments, e.g. ClustalW, Blastz, ) • Phylogenetic reconstruction (e.g. Phylip, IQPNNI, SplitsTree) • Database search (e.g. BLAST, HMMer) • Comparative sequence assembly (e.g. OSLay) • Gene finding (e.g. genscan, FirstEF) • Motif discovery (e.g. MEME, Weeder) • Protein structure (e.g. CE)

  44. Bioinformatics algorithms • Dynamic Programming • EM algorithms • Neural Networks • Hidden Markov Models • Support Vector Machine • Phylogenetic Trees • Clustering

  45. Bioinformatics Topics? • (YES?) Digital Libraries • Automated Bibliographic Search and Textual Comparison • Knowledge bases for biological literature • (YES) Motif Discovery Using Gibb's Sampling • (YES) Metabolic Pathway Simulation • (YES) Gene identification by sequence inspection • Prediction of splice sites • (YES) Linkage Analysis • Linking specific genes to various traits • YES) RNA structure predictionIdentification in sequences • (YES) Homology modeling

More Related