1 / 52

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics. Learning Objectives. Python matplotlib library to visualize data: Scatter plot Histogram Kernel density estimate Box plots Descriptive statistics: Mean and median

larissa
Télécharger la présentation

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

  2. Learning Objectives • Python matplotlib library to visualize data: • Scatter plot • Histogram • Kernel density estimate • Box plots • Descriptive statistics: • Mean and median • Standard deviation and inter quartile range • Central limit theorem

  3. An Example Data Set 0.022 -0.083 0.048 -0.010 -0.125 0.195 -0.071 -0.147 0.033 0.080 0.073 0.016 0.148 0.135 0.006 -0.089 0.165 -0.088 -0.137 0.094

  4. Scatter Plot 0.022 -0.083 0.048 -0.010 -0.125 0.195 -0.071 -0.147 0.033 0.080 0.073 0.016 0.148 0.135 0.006 -0.089 0.165 -0.088 -0.137 0.094 Measurement Order or Measurement

  5. Histogram Measurement Order or Measurement Bin size = 0.1 Bin size = 0.05 Bin size = 0.025 Number of Measurements Number of Measurements Number of Measurements Measurement Measurement Measurement

  6. Cumulative Distributions Measurement Order or Measurement Cumulative Frequency Measurement

  7. Kernel Density Estimate Measurement Order or Measurement Number of Measurements Measurement

  8. Original Distribution Measurement Order or Measurement Histogram Kernel Density Estimate Original Distribution Bin size = 0.05 Number of Measurements Number of Measurements Frequency Measurement Measurement Measurement

  9. More Data Measurement Order or Measurement Histogram Kernel Density Estimate Original Distribution Bin size = 0.05 Number of Measurements Number of Measurements Frequency Measurement Measurement Measurement

  10. Exercise 1 Download ibb2015_7_exercise1.py (a) Draw 20 points from a normal distribution with mean=0 and standard deviation=0.1. import numpy as np y=0.1*np.random.normal(size=20) print y [-0.09946073 -0.19612617 0.03442682 0.02622746 -0.28418124 -0.04245968 0.05922837 0.01199874 0.13454915 -0.07482707 -0.11688758 0.01714036 0.03280043 0.01356022 0.09128649 -0.18923468 0.14536047 -0.07764629 -0.0349553 0.04300367]

  11. Exercise 1 (b) Make scatter plot of the 20 points. import matplotlib.pyplot as plt x=range(1,points+1) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.scatter(x,y,color='red',lw=0,s=40) ax1.set_xlim([0,points+1]) ax1.set_ylim([-1,1]) fig.savefig('ibb2015_7_exercise1_scatter_points'+str(points)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

  12. Exercise 1 (c) Plot histograms. for bin in [20,40,80]: fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.hist(y,bins=bin,histtype='step',color='black', range=[-1,1], lw=2, normed=True) ax1.set_xlim([-1,1]) fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_points'+str(points)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

  13. Exercise 1 (d) Plot cumulative distribution. y_cumulative=np.linspace(0,1,points) x_cumulative=np.sort(y) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.plot(x_cumulative,y_cumulative,color='black', lw=2) ax1.set_xlim([-1,1]) ax1.set_ylim([0,1]) fig.savefig('ibb2015_7_exercise1_cumulative_points'+ str(points)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

  14. Exercise 1 (e) Plot kernel density estimate. import scipy.stats as stats kde_points=1000 kde_x = np.linspace(-1,1,kde_points) fig, (ax1) = plt.subplots(1,figsize=(6,6)) kde_y=stats.gaussian_kde(y) ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2) ax1.set_xlim([-1,1]) fig.savefig('ibb2015_7_exercise1_kde_points'+str(points)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

  15. Comparing Measurements

  16. Comparing Measurements – Cumulative distributions

  17. Systematic Shifts

  18. Exercise 2 Download ibb2015_7_exercise2.py (a) Generate 5 data sets with 20 data points each from normal distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard deviation=0.1. y=[] for j in range(5): y.append(0.1*np.random.normal(size=20)) y[2]+=0.1 y[3]+=0.5 y[4]+=0.3 print y

  19. Exercise 2 (b) Make scatter plots for the 5 data sets. sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B', '#963725','#4D0132'] fig, (ax1) = plt.subplots(1,figsize=(6,6)) for j in range(5): ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20), y[j],color=sixcolors[6-(j+1)], lw=0, alpha=1) ax1.set_xlim([0,6]) ax1.set_ylim([-1,1]) fig.savefig('ibb2015_7_exercise2_scatter_sample'+ str(20),dpi=300,bbox_inches='tight') plt.close(fig)

  20. Correlation Between Two Variables

  21. Correlation Between Two Variables

  22. Correlation Between Two Variables

  23. Correlation Between Two Variables

  24. Correlation Between Two Variables

  25. Data Visualization http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

  26. Process of Statistical Analysis Population Random Sample Make Inferences Describe Sample Statistics

  27. Distributions • Normal • Skewed • Long tails • Complex • n=3 • n=10 • n=100

  28. Mean Sample Mean

  29. Mean - Sample Size • Normal Distribution • 0.2 Mean • 0.0 • -0.2 • 0 • 20 • 40 • 60 • 80 • 100 Sample Size

  30. Mean – Sample Size • Normal • Skewed • Long tails • Complex • 1 • -1 • 0.2 • -0.2 • 100 Sample Size

  31. Mode, Maximum and Minimum Sample Mode the most common value Maximum Minimum

  32. Median, Quartiles and Percentiles Sample Quartiles for 25% of the sample for 50% of the sample (median) for 75% of the sample Percentiles for m% of the sample

  33. Median and Mean – Sample Size • Normal • Skewed • Long tails • Complex • 1 Median - Gray • -1 • 0.2 • -0.2 • 100 Sample Size

  34. Variance Sample Mean Variance

  35. Variance – Sample Size • Normal • Skewed • Long tails • Complex • 0.6 • 0 • 0.1 • 0 • 100 Sample Size

  36. Inter Quartile Range (IQR) Sample Quartiles for 25% of the sample for 50% of the sample (median) for 75% of the sample Inter Quartile Range

  37. Inter Quartile Range and Standard Deviation • Normal • Skewed • Long tails • Complex • 1.0 IRQ/1.349 - Gray • 0 • 0.4 • 0 • 100 Sample Size

  38. Central Limit Theorem • The sum of a large number of values drawn from many distributions converge normal if: • The values are drawn independently; • The values are from the one distribution; and • The distribution has to have a finite mean and variance.

  39. Uncertainty in Determining the Mean • Normal • Skewed • Long tails • Complex • n=3 • n=3 • n=3 • n=10 • n=100 • n=10 • n=10 • n=10 • n=1000 • n=100 • n=100 • n=100 Mean

  40. Standard Error of the Mean Sample Mean Variance Standard Error of the Mean

  41. Exercise 3 Download ibb2015_7_exercise3.py (a) Generate skewed data sets. sample_size=10 x_test=np.random.uniform(-1.0,1.0,size=30*sample_size) y_test=np.random.uniform(0.0,1.0,size=30*sample_size) y_test2=skew(x_test,-0.1,0.2,10) y_test2/=max(y_test2) x_test2=x_test[y_test<y_test2] x_sample=x_test2[:sample_size] Generate a pair of random numbers within the range. Assign them to x and y Keep x if the point (x,y) is within the distribution. Repeat 1-3 until the desired sample size is obtained. The values x obtained in this was will be distributed according to the original distribution.

  42. Exercise 3 (b) Calculate the mean of samples drawn from the skewed data set and the standard error of the mean, and plot the distribution of averages. for repeat in range(1000): … average.append(np.mean(x_sample)) sem=np.std(average) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.set_title('Sample size = '+str(sample_size)+', SEM = ' +str(sem)) ax1.hist(average,bins=100,histtype='step',color='red',range=[-0.5,0.5],normed=True,lw=2) ax1.set_xlim([-0.5,0.5])

  43. Box Plot M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

  44. Box Plots • Normal • Skewed • Long tails • Complex • n=5 • n=5 • n=5 • n=5 • n=10 • n=10 • n=10 • n=10 • n=100 • n=100 • n=100 • n=100

  45. Box Plots with All the Data Points • Normal • Skewed • Long tails • Complex • n=5 • n=5 • n=5 • n=5 • n=10 • n=10 • n=10 • n=10 • n=100 • n=100 • n=100 • n=100

  46. Box Plots, Scatter Plots and Bar Graphs • Normal Distribution • error bars: standard deviation • Error bars: standard deviation • error bars: standard error • error bars: standard error

  47. Box Plots, Scatter Plots and Bar Graphs • Skewed Distribution • error bars: standard deviation • Error bars: standard deviation • error bars: standard error • error bars: standard error

  48. Exercise 4 Download ibb2015_7_exercise4.py and plot box plots for a skewed data set. fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size), x_sample, facecolors='none', edgecolor=thiscolor, lw=1) bp=ax1.boxplot(x_samples, notch=False, sym='') plt.setp(bp['boxes'], color=thiscolor, lw=2) plt.setp(bp['whiskers'], color=thiscolor, lw=2) plt.setp(bp['medians'], color='black', lw=2) plt.setp(bp['caps'], color=thiscolor, lw=2) plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0)fig.savefig(…)

  49. Descriptive Statistics - Summary • Example distribution: • Normal distribution • Skewed distribution • Distribution with long tails • Complex distribution with several peaks • Mean, median, quartiles, percentiles • Variance, Standard deviation, Inter Quartile Range (IQR), error bars • Box plots, bar graphs, and scatter plots

  50. Descriptive Statistics – Recommended Reading http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html

More Related