1 / 59

生物信息学基础讲座

生物信息学基础讲座. 第 1 讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217. 什么是生物信息学 ?. Concepts of Bioinformatics. Biology + computer?. (Molecular) Bio - informatics

Télécharger la présentation

生物信息学基础讲座

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 生物信息学基础讲座 第1讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217

  2. 什么是生物信息学? Concepts of Bioinformatics

  3. Biology + computer? • (Molecular)Bio - informatics • Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (applied math, CS, and statistics) to understand and organize the informationassociated with these molecules, on alarge-scale. • Bioinformatics is a practical discipline with many applications.

  4. 生物分子信息 生物信息的载体 分子 存贮、复制、传递和表达 遗传信息的系统 细胞

  5. 两种信息载体 • 核酸分子 • 蛋白质分子

  6. Protein Machines

  7. Cell  Protein Machines

  8. 三种信息 与功能相关的结构信息 遗传信息 进化信息

  9. 实验数据信息知识 收集表示分析建模 刻画特征比较推理 应 用 基因工程 蛋白质设计 疾病诊断 疾病治疗 开发新药

  10. 交叉学科 • 以生物医学数据和过程为研究对象 • 采用数学、统计学方法和手段 • 借助现代计算机技术和方法 • 解读生物学数据 • 建立数学模型 • 对生物过程进行预测和控制

  11. 研究性学科中的应用学科 • 该学科目前还处于初步阶段 • 大部分还停留在实验室的水平 • 工业应用仍有许多不足

  12. 研究思路与方法 General Approaches to Bioinformatics

  13. 一般研究策略 • 提出生物学问题 • 数学描述与建模:前提、假设、问题的核心分别是什么 • 计算机分析:有时需要采取数值方法进行求解,有时需要采取蒙特卡洛的方法 • 分析结果的生物学意义

  14. 生物信息学的研究内容 Working Around Central Dogma

  15. 生物序列分析若干问题

  16. Genomes • Integrative Data 1995, HI (bacteria): 1.6 Mb & 1600 genes done 1997, yeast: 13 Mb & ~6000 genes for yeast 1998, worm: ~100Mb with 19 K genes 1999: >30 completed genomes! 2003, human: 3 Gb & 100 K genes...

  17. 1995 • Bacteria, 1.6 Mb, ~1600 genes [Science269: 496] 1997 • Eukaryote, 13 Mb, ~6K genes [Nature 387: 1] 1998 • Animal, ~100 Mb, ~20K genes [Science282: 1945] 2000? • Human, ~3 Gb, ~100K genes [???]

  18. 人类基因组与其它生物基因组比较

  19. DNA • Raw DNA Sequence • Coding or Not? • Parse into genes? • 4 bases: AGCT • ~1 K in a gene, ~2 M in genome • ~3 Gb Human atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

  20. Motif discovery

  21. Protein Sequence • 20 letter alphabet • ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • >1M known protein sequences (uniprot) d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA d3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA d3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

  22. 研究对象 • Basic building blocks • Genome/DNA • Genes, proteins • Cis-regulatory elements • Basic mechanisms • Transcription • Splicing • Translation (step 1~3 = gene expression) • Post-translational modification / protein folding • Transcriptional regulatory mechanisms and other regulatory mechanisms • Alternative splicing • microRNAs • DNA/protein modification

  23. 研究内容 • Sequencing • similarity search: find similar sequences in databases • global search • local search • Multiple alignment • Motif discovery • Gene finding:determining gene and gene structure • Phylogenetic analysis • Comparative studies

  24. 生物大分子结构分析研究 Extending Studies to 3-Dimensions

  25. Hierarchy of Protein Structure Tertiary

  26. Protein 3D Structure 3D Structure of Protein Turn or coil Alpha-helix Beta-sheet Loop and Turn

  27. Proteins Structural Classes

  28. 结构分析Structural Analysis • 二级结构预测secondary structure prediction • 三级结构预测 tertiary structure prediction • 分子对接molecular docking • 分子动力学模拟 molecular dynamics (MD) simulation

  29. 表达数据分析 Expression studies

  30. Microarray data analysis Low-expression gene High-expression gene Highly expressed in control, but not treatment cells Highly expressed in treatment, but not control cells

  31. Supervised Classification

  32. Unsupervised Clustering

  33. 二维电泳图

  34. Finding Genes in Genomic DNA Introns/exons/promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs Genome Sequence

  35. Sequence Alignment non-exact string matching, gaps Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Multiple alignment Transitive Comparisons HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value (or an e-value)? Score Distributions(extreme val. dist.) Low Complexity Sequences Evolutionary Issues Rates of mutation and change Protein Sequence

  36. Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein v RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio (Quaternary structure prediction) Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity Sequence  Structure

  37. Structure Comparison Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library Docking and Drug Design as Surface Matching Structures

  38. Relational Database Concepts Keys, Foreign Keys SQL, OODBMS, views, forms, transactions, reports, indexes Joining Tables, Normalization Natural Join as "where" selection on cross product Array Referencing (perl/dbm) Forms and Reports Cross-tabulation DB interoperation What are the Units ? What are the units of biological information for organization? sequence, structure motifs, modules, domains How classified: folds, motions, pathways, functions? Clustering and Trees Basic clustering UPGMA single-linkage multiple linkage Other Methods Parsimony, Maximum likelihood Evolutionary implications Visualization of Large Amounts of Information The Bias Problem sequence weighting sampling DBs/Surveys

  39. Mining • Information integration and fusion • Dealing with heterogeneous data • Dimensionality Reduction (PCA etc)

  40. Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Single-molecule Perspective Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees (Func) Genomics

  41. Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over time? How to measure the change in a vector (gradient) Molecular Dynamics & MC Energy Minimization Parameter Sets Number Density Simplifications Poisson-Boltzman Equation Lattice Models and Simplification Simulation

  42. Systems biology: E-cell

  43. Metabolism network

  44. Protein-protein interaction (PPI)

  45. Nature 406 2000

  46. Application 1Novel Drug design • Understanding How Structures Bind to Small Molecules (Function) • Designing Inhibitors • Docking, Structure Modeling

  47. 生物信息学中的informatics

  48. 常用计算机技术 • 数据库技术:建库与搜索 Database • 统计 statistics • 数据挖掘技术data mining • 人工智能Artificial Intelligence • 模式识别 pattern recognition • 机器学习machine learning

More Related