統計遺伝学 Statistical Genetics

統計遺伝学Statistical Genetics 2009/09/01 2009/09/02 2009/09/03 2009/09/04 ゲノム医学センター Center for Genomic Medicine

ノートを取りましょうTake notes yourself • 自分の言葉でノートを取ることで能動的な理解が得られます。 • Taking notes yourself is the way to understand the contents ACTIVELY.

構成Contents • 1 • コンピュータの準備 • Preparation of your computer • 2 • 遺伝子多型 • Genetic polymorphisms • 3 • 実験データのチェック • Check experimental data • 4 • 個別マーカーの検定 • Test individual markers • 5 • 複数のマーカーの検定 • Test multiple markers

1 & 2 • 多型の生成の基礎である、変異・組換え・浮動を理解し、浮動を題材にエクセルでデータシミュレーションの基礎を学ぶ • Mutation, recombination and drift make polymorphisms. Learn basics to simulate polymorphism data with EXCEL. • フリーのソフトをダウンロードして使えるようになること • Learn how to set up free applications from internet.

遺伝子多型Genetic polymorphisms • DNA • A,T,G,C • Variations • 種間多様性　Inter-species variations • 種内多様性　Intra-species variations

遺伝的多様性Genetic heterogeneity • 変異　Mutation • 変異体 Mutant • 組み換え Recombination • 組み換え体 Recombinant • 遺伝的浮動 Genetic drift • アレル頻度変化 Change in allele frequency • アレルの固定 Fixation

EXCELでMutationのシミュレーションSimulate mutation with EXCEL • 変数 Parameter • 変異率 Mutation rate • 世代/単位時間あたり、座位あたりに変異がおきる確率 • Is probability that mutation happens per locus per generation/unit of time • 列　Columns • 第一列　The first column • 変異座位数集計用　For sum of mutated loci • 第二列以降　The second and after columuns • 座位 Loci • 行　Rows • 世代　Generations

第一世代　The first generation • 全座位　０ • All loci 0

第二世代以降を作るMake the 2nd and after generations • RAND() : • 0-1の均一分布からの乱数 • Random values from uniform distribution from 0 to 1

均一分布Uniform distribution • 検定の基礎　Basics of statistical tests • 確かめる　Make sure they are so. • 値貼り付け Copy and paste without changing the value • ソート　Sort • プロット Plot

RAND()を使ってアレルを変化させるMutate alleles with RAND() • “=IF(RAND()<$B$1,B3+1,B3)” • RAND()が変異率　$B$1より小さいときにアレルの値を一つ大きくする • When RAND() is less than mutation rate $B$1, increase allele value by one • “=IF(RAND()<$B$1,IF(B3+1=2,0,1),B3)” • アレルの値が２になったら１に戻るように変える • When allele value becomes 2, change it back to 0

世代ごとに変異アレルの数を数えるCount loci with allele 1 • “=SUM(B3:EF3)” • グラフを描く　Draw a graph • Mutation rate を変えてみる • Change mutation rate

EXCELで組換えのシミュレーションSimulate recombination with EXCEL • 変数 Parameter • 組換え率　Recombination rate • 単位時間/世代あたり、座位間あたりに組換えの起きる確率 • Is probability that recombination happens per inter-loci per generation/unit of time • 列　Columns • 座位 Loci • 行　Rows • 世代　Generations

第一世代　The first generation • 全座位　０ • All loci 0 • 第二世代以降　The second and after • 第一座位　The first locus • “=IF(RAND()<$B$1,IF(A2+1=2,0,1),A2)” • 第二座位以降　The second and after loci • =IF(RAND()<$B$1,IF(B3=B2,IF(C2+1=2,0,1),C2),IF(B3=B2,C2,IF(C2+1=2,0,1))) • 一つ前の座位が組換えてあったかなかったかで対応を変える • The condition whether the previous locus is recombinant or not affects the allele

色を塗ってわかりやすくするColor segments • セルの書式設定で1のセルの色を変える • Color cells with allele 1 (Cell format??) • 縮小表示で全体を眺める

EXCELで浮動のシミュレーションSimulate drift with EXCEL • 注意：このシミュレーションの仕方はDiploidであることや、Mating の制約など、いろいろな面でヒトのそれとして不適切な点を含む • Note: This simulation method contains multiple problems as human genetics; diploid, mating system etc.

EXCELで浮動のシミュレーションSimulate drift with EXCEL • 初期値　Initial inputs • アレル頻度　Allele frequency: af • 染色体本数　No. chromosomes: nc • 世代数　No. generation: ng • 中立からの逸脱度　Deviation from neutrality: w

初期値を与えるInitial inputs • エクセルファイルが完成した暁には、シミュレーションするにあたって、ここで入力した初期値のみを変更します。 • When your excel file is completed, only these initial inputs will be changed for simulation.

サンプル数ｘ世代数の枠を与えるMake a frame of No. samples x No. generations • 列と行に連続整数を与えるのは簡単 • Colum and row with (0),1,2,… : Simple • “IF”関数を使って、サンプル数と世代数をコントロールする • Control nc and ng with “IF” • “$D$1” : No. chromosomes, “$F$1” : No. generations • “=IF(i<= $D$1,1,0)”, “=IF(j<=$F$1,1,0)

乱数からアレルを決めるSet allele with random values • “=IF(RAND()<$B$1,1,0) • 乱数がアレル頻度未満のときに１、それ以外は０ • When random value is less than allele freq., allele is 1, otherwise 0. • “=IF(AND($B6=1,C$3=1), IF(RAND()<$B$1,1,0),0)” • B列が1ならアレルを決める、そうでなければ０にする • The chromosomes with 1 in column B, allele is given, otherwise 0. • Drag!

アレル頻度を計算するCalculate allele freq. • 計算のための行を２行作る • Insert 2 rows to calculate allele freq. • “=SUM(C6:C10000)” • “=C4/$D$1”

次世代を作るSimulate next generation • k世代のデータを作るときにk-1世代のアレル頻度を使う • Use allele frequency in k-1 generation to make k generation. • C6のセルの記載内容をツールバー下のウィンドウからコピーし、D6の入力内容とする。ただし、適当に修正する • Copy the content of cell C6 by copying the window below tool bar, then input it into D6 with appropriate modification.

すべての世代にコピーするCopy to all generations

サイズを大きくし、アレル頻度変化のグラフを描くExpand the size and draw a graph of allele frequency change • 縦軸は最小値０、最大値１に固定する • Fix the max and min of the horizontal axis • グラフを入力パラメタが見える場所に移動する • Move the graph to the area where you can see input parameters.

生存に有利にしてみるMake the allele beneficial to survive • “=IF(AND($B6=1,D$3=1),IF(RAND()<C$5*$H$1,1,0),0)” • 次世代のアレル頻度C$5が生殖年齢に達する割合を$H$1倍する • Multiple allele freq. (C$5) by $H$1 to increase the fraction of the allele in reproductive age.

余裕のある人は・・・染色体と世代の縦横を入れ替えてみるExchange chromosomes and generations if you are quicker than others…

コンピュータの準備Preparation of your computer • 表計算ソフト（エクセルなど） • Spreadsheet application such as EXCEL • ハプロビュー • Haploview • “Haploview”で検索してインストール • Search “Haploview” then install it • Rフリー統計ソフト • R statistical environment • “CRAN”で検索してインストール • Search “CRAN”then install it

作業フォルダ”TestFolder”を作るMake a working folder “TestFolder” • EXCEL • Open a new document with EXCEL • “test.txt”という名前でタブ区切りで”TestFolder”に保存 • Save as “test.txt” with “tab” delimiters • Haploview • Open haploview • Open sample files. • R • Open R • Change working directory to the folder “TestFolder”

ＳＮＰのアレル頻度Allele frequency of SNP • 大多数のＳＮＰのアレル頻度は低く、アレル頻度が高くなるに従って数は減る • Vast majority of SNPs have low allele frequency. The higher allele frequency, the less. • アレル頻度が１０％以上になると、ＳＮＰの頻度はほぼ同頻度 • SNPs with af more than 10% exist almost evenly.

アレル頻度のシミュレーションSimulate allele frequnecy • SNPの場合　SNP • 2アレル型　Diallelic • 片方のアレルの頻度は、0-1均一分布 • Allele freq. of one allele takes uniform distribution from 0 to 1. • “RAND()” in EXCEL

アレル頻度のシミュレーションSimulate allele frequnecy • Rを使ってみる　Use R • ” af<-runif(1000)” • 均一分布の確認　Make sure “af” in uniform distribution • 度数分布　Histogram • ” hist(af)” • ソートしてプロット　Sort and plot • ” afsort<-sort(af)” • ” plot(afsort)”

Hardy-Weinberg Equilibrium (HWE) and Disequilibrium • 個体は染色体をペアで持つ。染色体がランダムにペアを作っている状態をHWEという • Diploid organisms have pairs of chromosomes. When chromosomes pair randomly, it is said HWE. • Allele freq. are p and q; p+q=1 • Diploid frequency • p*p=p^2, 2pq, q^2 in HWE • p^2+2pq*f, 2pq(1-f), q^2+2pq*f • f: Fixation index • When f=0, HWE • When f=1, no heterozygotes.

HWE and f • Simulate with EXCEL. • アレル頻度を与えます　Give allele frequency • もう一方のアレル頻度を計算します　Calculate allele freq. of another allele • アレル頻度１，アレル頻度２、f、ホモ頻度、ヘテロ頻度、逆ホモ頻度、３ディプロタイプ頻度の和、ＨＷＥの場合のホモ・ヘテロ・逆ホモ頻度を１行に • Allele freq1, allele freq2, f, homozygous freq, heterozygous freq, another homozygous freq, sum of three diplotypes, homozygous/heterozygous/another homozygous freq in HWE, place them in a row.

Chi-square test of HWE • 人数Ｎのカラムを加える　 • Add a column for N, No. individuals. • 次のカラムも付け加える　Add following columns. • ディプロタイプ頻度（ＨＷＤとＨＷＥ）をＮ倍する • Multiply diplotype freq in HWD and HWE by N. • {D1,D2,D3},{E1,E2,E3}: No. samples of each diplotypes in HWD/HWE • Chi^2=(D1-E1)^2/E1+(D2-E2)^2/E2+(D3-E3)^2/E3

相関プロット　Coplot f and chi^2 • ｆの列で、ｆを０から１まで0.1刻みで増やす • Increase f from -1 to 1 by 0.1 in the column • その他のカラムはコピーペースト • Copy for other columns • fとchi^2のカラムで相関プロット • Coplot for two columns f and chi^2 • 近似曲線の追加(多項式) • Add approximate line (Polynomial) • Chi^2=N*f^2

アレル頻度のシミュレーションSimulate allele frequnecy • ディプロタイプ頻度分布のシミュレーション • Simulate diplotype frequency • P^2+pq*f, 2pq(1-f), q^2+pq*f • runifではなく, No runif • ディリクレ分布からの乱数発生 • Random generation from Dirichlet distribution • “MCMCpack”パッケージをインストール　Install “MCMApack” package • ツールバーからインストール先を選択　Select instal lsite from toolbar • パッケージを読み込む　Read the package • “library MCMCpack”

”af<-rdirichlet(10,c(1,1))” • ２個のアレルの頻度が１０セット • Ten sets of allele frequency of two alleles • “af<-rdiriclet(1000,c(1,1))” • “hist(af[,1])” • “hist(af[,2])” • “plot(sort(af[,1])” • “plot(sort(af[,2])”

アレル頻度に差をつけてシミュレーションするアレル頻度に差をつけてシミュレーションする • Simulate allele freq with difference between two alleles • “af<-rdirichlet(1000,c(0.75,0.25)) • “hist(af[,1])” • “mean(af[,1])”“mean(af[,2]) • “apply(af, 2,mean)”

アレル頻度ヒストグラムの集中度を上げる • Make histogram of allele freq. more peaked • 分散を小さくする Make variance smaller • “af<-rdirichlet(1000,c(0.75,0.25)*100)” • “hist(af[,1])” “apply(af,2,mean)”

HWE/HWD→LE/LD • 2x2 table • {p,q} x {p,q} → {p1,q1} x {p2,q2} • f → r • delta=pqf → delta=sqrt(p1p2q1q2)r • Chi^2=N r^2

3 • 実験データのチェック • Check experimental data • データの何をチェックするのか？ • What in your data do you check? • チェックして、その後、どうしたいのか？ • Check, then what do you want to do?

データは解析の対象として適切か？Are the data appropriate to be analyzed? • データが解析にそぐわないとは • What does “inappropriate for analysis” mean? • データ解析は仮説・モデルが与える分布に合うかどうかで行う • Data are analyzed by comparing them with distribution given by hypothesis/model.

分布　Distribution • A. 仮説・モデルが与える理想的統計分布 • A. Ideal statistical distribution given by hypothesis/model • B. 滑らかだが理想的でない分布 • B. Smooth but non-idealistic distribution • C. 滑らかでない分布・外れ値のある分布 • C. Non-smooth distribution, outliers

対処　What to do? • A. 理想的 Ideal →OK • B. Smooth but not ideal →Seek methods that can understand and utilize the distribution. • C. Non-smooth, Outliers →スムーズでない理由、外れ値の理由を見つける →Identify sources of non-smoothness, outliers →“理由”を持っているものを取り除く(値が外れているものを取り除くのではなく) →Remove items with the “cause” (Do not remove items with outlier-value).

分布を知る　Know distributions • 度数分布を描く　Draw histogram • 累積分布を描く　Draw cumulative distribution • 平均 average と分散 variance

正規分布からの乱数Random variables from normal distribution • “N<-100000” • “data1<-rnorm(N,1)” • “hist(data1)”, “plot(sort(data1))”,”mean(data1)”,”var(data1)”

ポアッソン分布からの乱数Random variables from Poisson distribution • “N<-100000” • “data2<-rpois(N,0.1)” • “hist(data2)”, “plot(sort(data2))”,”mean(data2)”,”var(data2)”

ポアッソン分布がデータ不良を表し、それに正規分布がかぶっている場合When Poisson distribution represents inappropriateness of data and normal distribution is over it • “sum<-data1*0.1+data2” • “hist(sum)”,”plot(sort(sum))”,”mean(sum)”,”var(sum)” • プロットを図ファイルとして残せるようになろう • You should be able to save your drawing as a file!

データはRの外からやってくるYour data will be outside of R. • EXCEL file →　Text file(tab delimiter) • 複数列のファイル、多くの行のRAND() • Multiple columns with many rows with RAND() • “yourdata<-read.table(file=“yourfile.txt”,header=T)” • Or “yourdata<-read.table(file=“yourfile.txt”,header=F)”

分布　 • “hist(yourdata$col1)” • “plot(sort(yourdata$col1))” • “mean(yourdata$col1)” • “var(yourdata$col1)”

統計遺伝学 Statistical Genetics