430 likes | 579 Vues
This seminar presented by Edward Boone introduces participants to R, a powerful open-source statistical programming language. Attendees will learn how to start using R, including installation, basic operations, importing data, and utilizing built-in functions. The seminar covers fundamental concepts such as assignment, arithmetic operations, and data manipulation, encompassing statistical functions and graphical representations. Through practical examples, participants will gain insights into summary statistics, hypothesis testing, and regression modeling, equipping them with essential skills for data analysis.
E N D
Introduction to Data Analysis in R Department of Statistical Sciences and Operations Research Computation Seminar Series Speaker: Edward Boone Email: elboone@vcu.edu
What is R? • The R statistical programming language is a free open source package based on the S language developed by Bell Labs. • The language is very powerful for writing programs. • Many statistical functions are already built in. • Contributed packages expand the functionality to cutting edge research. • Since it is a programming language, generating computer code to complete tasks is required.
Getting Started • Where to get R? • Go to www.r-project.org • Downloads: CRAN • Set your Mirror: Anyone in the USA is fine. • Select Windows 95 or later. • Select base. • Select R-2.4.1-win32.exe • The others are if you are a developer and wish to change the source code.
Getting Started • The R GUI?
Getting Started • Opening a script. • This gives you a script window.
Getting Started Submit Selection • Submitting a program: • Use button • Right mouse click and run selection.
Getting Started • Basic assignment and operations. • Arithmetic Operations: • +, -, *, /, ^ are the standard arithmetic operators. • Matrix Arithmetic. • * is element wise multiplication • %*% is matrix multiplication • Assignment • To assign a value to a variable use “<-”
Getting Started • How to use help in R? • R has a very good help system built in. • If you know which function you want help with simply use ?_______ with the function in the blank. • Ex: ?hist. • If you don’t know which function to use, then use help.search(“_______”). • Ex: help.search(“histogram”).
Importing Data • How do we get data into R? • Remember we have no point and click… • First make sure your data is in an easy to read format such as CSV (Comma Separated Values). • Use code: • D <- read.table(“path”,sep=“,”,header=TRUE)
Working with data. • Accessing columns. • D has our data in it…. But you can’t see it directly. • To select a column use D$column.
Working with data. • Subsetting data. • Use a logical operator to do this. • ==, >, <, <=, >=, <> are all logical operators. • Note that the “equals” logical operator is two = signs. • Example: • D[D$Gender == “M”,] • This will return the rows of D where Gender is “M”. • Remember R is case sensitive! • This code does nothing to the original dataset. • D.M <- D[D$Gender == “M”,] gives a dataset with the appropriate rows.
Summary Statistics • Mean • mean(D$wg) • [1] NA • It doesn’t know what to do with the missing values. • mean(D$wg, na.rm=TRUE) • [1] 16.62016
Summary Statistics Median median(D$wg,na.rm=TRUE) [1] 15 Quantiles quantile(D$wg,c(.25,.5,.75),na.rm=TRUE) 25% 50% 75% 8 15 20 Summary summary(D$wg) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 8.00 15.00 16.62 20.00 70.00 132.00
Summary Statistics Range range(D$wg,na.rm=TRUE) IQR IQR(D$wg,na.rm=TRUE) Standard Deviation sd(D$wg,na.rm=TRUE)
Basic Testing • One Sample T-Test t.test(D$wg, alternative="greater", mu=10, conf.level=0.95) • Output One Sample t-test data: D$wg t = 8.7429, df = 257, p-value < 2.2e-16 alternative hypothesis: true mean is greater than 10 95 percent confidence interval: 15.37016 Inf sample estimates: mean of x 16.62016
Basic Testing • Comparing Men to Women on weight gain. • Two Sample T-Test wg.M <- D[D$Gender =="M",] wg.F <- D[D$Gender =="F",] t.test(wg.M$wg, wg.F$wg, alternative="two.sided", mu=0, conf.level = 0.95) • Output Welch Two Sample t-test data: wg.M$wg and wg.F$wg t = 1.2645, df = 142.21, p-value = 0.2081 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.132955 5.155447 sample estimates: mean of x mean of y 18.08571 16.07447
Basic Testing • Paired T-Test Pt <- read.csv("H:\\Pairedt.csv",header=TRUE) t.test(Pt$A,Pt$B, alternative="two.sided", mu=0, paired = TRUE, conf.level = 0.95) • Output Paired t-test data: Pt$A and Pt$B t = -0.1064, df = 9, p-value = 0.9176 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.558007 1.418007 sample estimates: mean of the differences -0.07
The lm Function • R organizes many linear models into the lm function. • Regression • ANOVA
Regression • Consider the following dataset. cherry <- read.csv("H:\\cherry.csv",header=TRUE) plot(cherry$Height,cherry$Volume)
Regression • Fit a regression model to the data. lm(Volume ~ Height, data = cherry) • Output: Call: lm(formula = Volume ~ Height, data = cherry) Coefficients: (Intercept) Height -87.124 1.543
Regression • Maybe there is more… cherry.lm <- lm(Volume ~ Height,data = cherry) summary(cherry.lm) • Output: Call: lm(formula = Volume ~ Height, data = cherry) Residuals: Min 1Q Median 3Q Max -21.274 -9.894 -2.894 12.067 29.852 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.1236 29.2731 -2.976 0.005835 ** Height 1.5433 0.3839 4.021 0.000378 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.4 on 29 degrees of freedom Multiple R-Squared: 0.3579, Adjusted R-squared: 0.3358 F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
Regression • And more… par(mfrow=c(2,2)) plot(cherry.lm)
Regression • What all is there? names(cherry.lm) • Output [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"
Regression • What all is there? • names(summary(cherry.lm)) • Output [1] "call" "terms" "residuals" "coefficients" [5] "aliased" "sigma" "df" "r.squared" [9] "adj.r.squared" "fstatistic" "cov.unscaled" • And even more…
Regression Plot the line par(mfrow=c(1,1)) plot(cherry$Height, cherry$Volume) abline(cherry.lm)
Regression • Fit confidence intervals to line. predict.lm(cherry.lm, interval="confidence") • Prediction intervals to a new data set can also be generated. • Influence measures can be generated as well. • Influence.measures() function does lots of these.
Multiple Linear Regression Matrix Plot pairs(cherry)
Multiple Linear Regression • To fit a multiple linear regression: cherry.lm <- lm(Volume ~ Height + Diam, data=cherry) summary(cherry.lm) Call: lm(formula = Volume ~ Height + Diam, data = cherry) Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 *** Height 0.3393 0.1302 2.607 0.0145 * Diam 4.7082 0.2643 17.816 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.882 on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Regression • ANOVA anova(cherry.lm) Analysis of Variance Table Response: Volume Df Sum Sq Mean Sq F value Pr(>F) Height 1 2901.2 2901.2 192.53 4.503e-14 *** Diam 1 4783.0 4783.0 317.41 < 2.2e-16 *** Residuals 28 421.9 15.1 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 What type of Sums of Squares are these?
Regression • ANOVA cherry.lm <- lm(Volume ~ Diam + Height, data=cherry) anova(cherry.lm) Analysis of Variance Table Response: Volume Df Sum Sq Mean Sq F value Pr(>F) Diam 1 7581.8 7581.8 503.1503 < 2e-16 *** Height 1 102.4 102.4 6.7943 0.01449 * Residuals 28 421.9 15.1 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression • Confidence Intervals confint(cherry.lm, level = 0.97) 1.5 % 98.5 % (Intercept) -77.73792775 -38.2373901 Diam 4.10395112 5.3123699 Height 0.04167615 0.6368263
Model Search • AIC extractAIC(cherry.lm) [1] 3.00000 86.93578 The first value is the effective degrees of freedom and the second value is the AIC • Stepwise. step(cherry.lm) This is based on AIC however can be specified to use BIC. Default is AIC Can do forward, backward and stepwise selection. Default is stepwise.
ANOVA • ANOVA do2 <- read.csv("H:\\do2.csv",header=TRUE) do2.aov <- aov(DO2 ~ Stream, data=do2) summary(do2.aov) • Output Df Sum Sq Mean Sq F value Pr(>F) Stream 2 37.056 18.528 4.5595 0.01898 * Residuals 29 117.844 4.064 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA • You can also use the lm framework do2.lm <- lm(DO2 ~ Stream, data=do2) anova(do2.lm) • Output Analysis of Variance Table Response: DO2 Df Sum Sq Mean Sq F value Pr(>F) Stream 2 37.056 18.528 4.5595 0.01898 * Residuals 29 117.844 4.064 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple Comparisons • Tukey’s HSD Procedure TukeyHSD(do2.aov) • Output Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = DO2 ~ Stream, data = do2) $Stream diff lwr upr p adj Piney-Cedar 0.4966667 -1.634953 2.62828674 0.8342003 South Run-Cedar -2.0533333 -4.184953 0.07828674 0.0607655 South Run-Piney -2.5500000 -4.776405 -0.32359544 0.0221846
Multiple Comparisons • Tukey’s HSD Procedure using the plot command. • do2.HSD <- TukeyHSD(do2.aov) • plot(do2.HSD)
Multiple Comparisons • Tukey’s HSD Procedure do2.HSD <- TukeyHSD(do2.aov, conf.level = 0.97) do2.HSD • Output Tukey multiple comparisons of means 97% family-wise confidence level Fit: aov(formula = DO2 ~ Stream, data = do2) $Stream diff lwr upr p adj Piney-Cedar 0.4966667 -1.832410 2.8257437 0.8342003 South Run-Cedar -2.0533333 -4.382410 0.2757437 0.0607655 South Run-Piney -2.5500000 -4.982642 -0.1173584 0.0221846
Multiple Comparisons • Plot it.
Kruskal Wallis Test • Tukey’s HSD Procedure kruskal.test(DO2 ~ Stream, data=do2) • Output Kruskal-Wallis rank sum test data: DO2 by Stream Kruskal-Wallis chi-squared = 7.5097, df = 2, p-value = 0.02340
Loess Regression • Regression from a non-parametric framework cherry.lo <- loess(Volume ~ Diam, data=cherry) cherry.lo$fitted • Output [1] 9.358813 10.471276 11.213760 17.518128 18.258269 18.629063 19.368892 [8] 19.368892 19.742982 20.117965 20.503236 20.890123 20.890123 21.950202 [15] 22.930251 26.020681 26.020681 27.657015 29.364601 29.799478 30.701120 [22] 31.658354 33.187952 41.983894 43.878359 50.561902 51.968665 54.854374 [29] 55.590857 55.590857 76.749972
Loess Regression cherry.plo <- predict(cherry.lo, cherry$Diam) plot(cherry$Diam,cherry$Volume) lines(cherry$Diam,cherry.plo)
Summary • R is programming environment with many standard statistical procedures already included. • Easy to control the procedure you are interested in. • No support. • Allows user to modify many existing procedures.
Summary • All of the R code and files can be found at: www.people.vcu.edu/~elboone2/CSS.htm