Create Presentation
Download Presentation

Download Presentation
## Introduction to R for Absolute Beginners: Part II

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Introduction to Rfor Absolute Beginners: Part II**Melinda Fricke Department of Linguistics University of California, Berkeley melindafricke@berkeley.edu D-Lab Workshop Series, Spring 2013**Welcome (back)!**What we covered last time: creating and manipulating objects variable assignment =, <-, -> types of objects single values vs. vectorsvs. data frames types of data numerical vs. character vs. factor (categorical)**Welcome (back)!**What we covered last time (cont’d): functions for manipulating data c() as.factor(), as.character(), as.numerical table(), aggregate() functions for getting around ls(), rm() read.table(), write.table() dim(), summary(), head(), tail() subscripting e.g. salary[1,1] dataframe rows columns**Today**• Basic graphing • Downloading and installing packages • Basic statistical tests • … Practice, practice, practice!**Professors’ salaries, revisited**http://linguistics.berkeley.edu/~mfricke/R_Workshop (S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194. Downloaded from http://data.princeton.edu/wws509/datasets/#salary on January 31st, 2013.) read.table(“salary.txt”, header=T) -> salary**Professors’ salaries**How many rows and columns are in this dataset? What are the possible values for “rk” (rank)? How many data points are there for each rank? How many males vs. females are there at each rank?**Professors’ salaries**How many rows and columns are in this dataset? dim(salary) [1] 52 6 What are the possible values for “rk” (rank)? levels(salary$rk) as.factor(salary$rk) -> salary$rk [1] “assistant” “associate” “full” How many data points are there for each rank? table(salary$rk) assistant = 18, associate = 14, full = 20 How many males vs. females are there at each rank? table(salary$rk, salary$sx) female male assistant 8 10 associate 2 12 full 4 16**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(y ~ x, data=NameOfDataframe) a basic scatterplot “as a function of”**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary)**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary) Look at the help file for “plot” and try to add the following to your plot: a main title labels for the x and y axes red dots (instead of the default black dots)**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary ($)”, col=“red”) a main title labels for the x and y axes red dots (instead of the default black dots)**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary ($)”, col=“red”) a main title labels for the x and y axes red dots (instead of the default black dots)**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary ($)”, col=“red”) more customizing: ylim= c(0, 40000) change the y range xlim = c(0, 20) change the x range pch = 19 specify “plotting character”**Plotting**Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary ($)”, col=“red”, ylim=c(0, 40000), xlim=c(0, 20), pch=19) more customizing: ylim= c(0, 40000) change the y range xlim = c(0, 20) change the x range pch = 19 specify “plotting character”**Saving a plot**jpeg(“ProfessorsSalaries.jpeg”, width=8, height=5, units=“in”, res=300) plot(sl~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary ($)”, col=“red”) dev.off()**More plotting**plot(salary)**More plotting**plot(salary) boxplot(sl ~ sx + rk, data=salary)**More plotting**plot(salary) boxplot(sl ~ sx + rk, data=salary) these variable names can be found in the data frame called “salary” show me the salary (= dependent variable) as a function of both sex and rank**Packages**A package is basically a set of specialized functions. There are 2 parts to using a package: downloading and installing. When you download R, it comes with certain packages already included. library() all downloaded packages sessionInfo() all installed packages**Packages**• To download and install a new package: • In R, go to the “Packages & Data” menu. • Go to the “Package Installer”. • Search for “animation” and click “Get List”. • Select the package and click “Install Selected”. • In the console, type: library(animation). • If nothing happens, that means everything went smoothly!**Packages**We’re going to work with some data sets that come installed with R, in the “datasets” package. library(help=datasets) head(faithful) eruptions = duration of each eruption waiting = # minutes since last eruption**Correlation**Is there a correlation between the duration of a given eruption and the number of minutes since the preceding eruption? Make a plot that shows the relationship between eruption duration and time since last eruption. plot(y ~ x, data = dataframe)**Correlation**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)")**Correlation**How can we tell if this correlation is statistically significant? cor.test() correlation test cor.test( ~ eruptions + waiting, data=faithful) cor = 0.9, p < 0.0001**Correlation: simple linear regression**How can we tell if this correlation is statistically significant? lm() linear model (regression) lm(eruptions ~ waiting, data = faithful) -> faithful.lm summary(faithful.lm)**Correlation: simple linear regression**> summary(faithful.lm) Call: lm(formula = eruptions ~ waiting, data = faithful) Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329 Coefficients: EstimateStd. Errort value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 *** waiting 0.075628 0.002219 34.09 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16**Correlation: simple linear regression**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(abline(faithful.lm))**Correlation: “exploratory plotting”**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful$waiting, faithful$eruptions), col=“red”, lwd=4) x coordinates y coordinates “line width” a function that draws a smooth line color of the line**Correlation: “exploratory plotting”**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful$waiting, faithful$eruptions), col=“red”, lwd=4) x coordinates y coordinates “line width” a function that draws a smooth line color of the line**Correlation: “exploratory plotting”**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful$waiting, faithful$eruptions), col=“red”, lwd=4) x coordinates y coordinates “line width” a function that draws a smooth line color of the line**Correlation: “exploratory plotting”**plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful$waiting, faithful$eruptions), col=“red”, lwd=4)**Practice: correlation**Remember the professors’ salaries dataset? • Is there a correlation between salary and years since degree? cor.test( ~ x + y, data=yourdata) lm(y ~ x, data=yourdata) plot(y ~ x, data=yourdata)**Practice: correlation**Is there a correlation between salary and years since degree? cor.test( ~ sl + yd, data=salary) cor = 0.675, p = 4.102e-08**Practice: correlation**Is there a correlation between salary and years since degree? cor.test( ~ sl + yd, data=salary) cor = 0.675, p = 4.102e-08 lm(sl ~ yd, data=salary) -> salary.lm summary(salary.lm) Coefficients: yd = 390.65, p < 0.0001 Good news! You’ll make $391 more for every year after you get your degree!**Practice: correlation**Is there a correlation between salary and years since degree? plot(sl ~ yd, data = salary, main = “Professors’ salaries”, xlab = “years since degree”, ylab = “salary ($)”) lines(abline(salary.lm), lty=2)**t-tests**A t-test asks the question: Are these two sample distributions drawn from the same underlying population? Example: test scores. Test scores generally form a normal distribution: a few are excellent, and a few are horrible, but the majority are right around the average. Given two sets of test scores, we want to know if the students from one school performed significantly better than the students at a different school.**t-tests**A few words about normal distributions: A normal distribution can be described by a mean, and some variation around that mean. Look at the help file for the function rnorm(). rnorm() generates random numbers from a normal distribution. Create a distribution of 100 random numbers, with the mean of the distribution centered around 0.**t-tests**rnorm(n, mean, sd) rnorm(100) try this multiple times**t-tests**rnorm(n, mean, sd) rnorm(100) try this multiple times hist(rnorm(100)) now try this multiple times hist(rnorm(1000, 4.5)) and this!**t-tests**rnorm(n, mean, sd) rnorm(100) try this multiple times hist(rnorm(100)) now try this multiple times hist(rnorm(1000, 4.5)) and this! Each time we generate a set of random numbers, we are “sampling” from a distribution. The more numbers we generate, the better idea we get of the “underlying” distribution.**t-tests**A t-test asks the question: Are these two sample distributions drawn from the same underlying population? Example: hist(rnorm(10), ylim=c(0,5), xlim=c(-5,5), col=“red”) hist(rnorm(10), add=T, col=“blue”) hist(rnorm(1000), ylim=c(0,500), xlim=c(-5,5), col=“red”) hist(rnorm(1000), add=T, col=“blue”)**t-tests**One more example of the normal distribution: library(animation) quincunx()**t-tests**Example: American vs. Japanese cars Download the .txt file located at http://linguistics.berkeley.edu/~mfricke/R_Workshop.html, and save it to your working directory. Read it in to R: read.table(“Cars-MPG.txt”, header=T) -> cars Take a minute to inspect the dataframe.**t-tests**hist(cars$American) a basic histogram hist(cars$Japanese, add=T) How can we fix this problem?**t-tests**hist(cars$American) hist(cars$Japanese, add=T) How can we fix this problem? xlim=c(0,50)**t-tests**hist(cars$Japanese, breaks=10, col="red", main="Fuel Efficiency in American vs. Japanese Cars", xlab="miles per gallon", xlim=c(0,50)) hist(cars$American, breaks=10, col="blue", add=T) legend("topright", legend=c("American", "Japanese"), fill=c("blue", "red"))**t-tests**hist(cars$American, breaks=10, col="blue", main="Fuel Efficiency in American vs. Japanese Cars", xlab="miles per gallon", xlim=c(0,50), prob=T) legend("topright", legend=c("American", "Japanese"), fill=c("blue", "red")) hist(cars$Japanese, breaks=10, col="red", add=T, prob=T) lines(density(cars$American), lwd=4, lty=2) lines(density(cars$Japanese), lwd=4, lty=2)**t-tests**t.test(cars$American, cars$Japanese)**t-tests**t.test(cars$American, cars$Japanese) Welch Two Sample t-test data: cars$American and cars$Japanese t = -17.3377, df = 138.232, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -16.10429 -12.80710 sample estimates: mean of x mean of y 16.02532 30.48101**ANOVA**An ANOVA (ANalysisOf VAriance) asks the question: Given more than two sample populations, are any of them drawn from different underlying populations? In other words: are any of these groups different?