240 likes | 413 Vues
Computational Biology. Prof. Bao Jiali Ph.D Bioelectromagnetics Key Laboratory, School of Medicine Tel: 88208171, 13018905641 Email: baojl@zju.edu.cn. 数据库. 软件. 数据算法. 生物信息学三要素. 数据算法——生物信息学的灵魂.
E N D
Computational Biology Prof. Bao Jiali Ph.D Bioelectromagnetics Key Laboratory, School of Medicine Tel: 88208171, 13018905641 Email: baojl@zju.edu.cn
数据库 软件 数据算法 生物信息学三要素
数据算法——生物信息学的灵魂 随着人类基因组计划的实施和深入,生物学数据积累出现了前所未有的飞跃。不仅数据量呈指数级增长,而且,数据的本质出现了从生理生化数据向遗传信息飞跃以及进一步向遗传与结构功能相互关系信息的飞跃。这种科学数据的急速海量积累,在人类的科学研究历史中是空前的。如何从这些海量的生物学数据中提取有用的知识,成为了对当前生物学家、数学家、计算机专家等的巨大挑战。由此引出了一门新兴学科:计算生物学。
计算生物学的Topics - information retrieval with Entrez and Web browsers - statistics of sequence patterns - basics of machine learning for molecular biology - pairwise sequence alignment - comparison with sequence databases - finding sequence motifs - finding protein coding regions - finding genes - clustering genes by expression - prediction of macromolecular properties - retrieving and displaying macromolecular structures
Algorithms • . Recursion 递归 • . Graph trees 图形树 • . Dynamic programming 动态规划 • . Classification 分类 • . Decision trees 决策树 • . Bounded search 有限搜索
Mathematical and statistical analysis • . Optimization 最优化 • . Combinatorial methods 组合法 • . Cluster analysis 聚类分析 • . Classification 分类 • . Bayesian inference 贝叶斯推理 • . Decision trees 决策树 • . Schocastic context free grammars Schocastic文本树
Artificial intelligence/machine learning • . Neural networks 神经网络 • . Genetic networks 遗传网络 • . Natural language processing 自然语言处理
Data representation • Data representation • Knowledge representation • Databases and knowledge bases • Programming languages • Graphics and image analysis • Modeling Usability engineering
Technology support • . Crystallography • . Micro arrays • . Mass spectrometry • . NMR
Sequence alignments • Structure alignments • Phylogenetic tree construction • Fragment and whole genome assembly 片段与装配 • Genome comparison • Biological databases • Expression analysis 表型分析 • Feature extraction from sequences and structures 特征提取 • Structure prediction (RNA, DNA and protein) • Docking • Knowledge extraction • Protein - protein interactions • Interaction networks • Integrated systems
. Protein and genomic sequences • . Gel electrophoresis • . Structures (coordinates, structure factors, NMR constraints) • . Expression data (micro arrays) • . Spectroscopic (mass spec., circ. dichroism) • . Kinetic • . Thermodynamic • . Interaction data (binding constants) • . Images
C T A G Probabilities and probabilistic models • 字符集: A ={A,C,G,T} • 字符概率:pA, pC, pG, pT • pi 0 (i = A,C,G,T)
生物序列的数学模型 • 假设数据是一个长度为N的观察序列: D = {O} 其中: • 模型有四个参数:pA, pC, pG, pT • 似然度: • 后验概率:
练习 写出下列序列的数学模型: AAAGGTTGGACCCCTTTAAA
Maximum likelihood estimation 模型参数: 序列数据:D 最大似然估计:P(D|)为最大
Conditional, joint, and marginal probabilities 条件概率:P(i| D1) 联合概率: P(i, Dj)j= 1, 2 P(i, Dj) = P(Dj)P(i|Dj) 联合概率描述: P(X, Y) = P(X|Y) P(Y) 全概率:
Exercise Consider an occasionally dishonest casino that uses two kinds of dice. Of the dice 99% are fair but 1% are loaded so that a six comes up 50% of the time. We pick up a die from a table at random. What are P(six|Dloaded) and P(six|Dfaur)? What are P(six, Dloaded) and P(six, Dfaur)? What is the probability of rolling a six from the die we picked up?
Baye’s theorem and model comparison 条件概率:P(i| D1)
Bayesian parameter estimation 条件概率:P(i| D1)