310 likes | 414 Vues
Learn about R, a flexible & free software package for data analysis and visualization. Explore basic operations, data types (vectors, matrices, data frames), and visualization techniques. Practice linear regression and data manipulation using R for insightful insights.
E N D
Introduction to R Jiang Du Jan 17th 2008
What is R? • A software package for data analysis and graphical representation • Scripting language • Flexible and customizable • Free… • Weaknesses • Not particularly efficient in handling large data sets • Slow in executing big loops
Where to get R? • http://www.r-project.org/
Basic operations > 1+2*3 [1] 7 > log(10) [1] 2.302585 > 4^2 [1] 16 > sqrt(16) [1] 4 > pi [1] 3.141593
Basic operations > x = pi * 2 > x [1] 6.283185 > floor(x) [1] 6 > ceiling(x) [1] 7
Data type: vector > x = c(1,2,3,5,4) > x [1] 1 2 3 5 4 > y = 1:5 > y [1] 1 2 3 4 5 > x + 2 [1] 3 4 5 7 6 > x+y [1] 2 4 6 9 9 > length(x) [1] 5 > sorted_x = sort(x) > sorted_x [1] 1 2 3 4 5
Data type: vector > x [1] 1 2 3 5 4 > x[3] [1] 3 > x[1:2] [1] 1 2 > x[-3] [1] 1 2 5 4 > x[x > 3] [1] 5 4 > x > 3 [1] FALSE FALSE FALSE TRUE TRUE > which(x > 3) [1] 4 5
Data type: matrix > m = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) > m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > m[1, 2] [1] 2 > m[1:2, 2:3] [,1] [,2] [1,] 2 3 [2,] 5 6
Data type: matrix > m2 = matrix(c(2,0,0,0,2,0,0,0,2), nrow = 3, byrow = TRUE) > m2 [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 2 0 [3,] 0 0 2 > m * m2 [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 10 0 [3,] 0 0 18 > m %*% m2 [,1] [,2] [,3] [1,] 2 4 6 [2,] 8 10 12 [3,] 14 16 18
Date type: data frame > a = c(1:5) > b = a^2 > df = data.frame(a,b) > df a b 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 > df$b [1] 1 4 9 16 25 > df[3, 2] [1] 9
Data type: data frame > dim(df) [1] 5 2 > subset(df, a > 2) a b 3 3 9 4 4 16 5 5 25 > subset(df, a > 2 & b < 10) a b 3 3 9
Visualization of data > x = 1:10 > y = x^2 > plot(x, y) > z = c(rep(1, 3), rep(5:6, 10), 1:10) > hist(z)
Visualization of data > x = seq(-10, 10, length= 30) > y = x > f = function(x,y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r } > z = outer(x, y, f) > persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue")
Loops, functions, etc. > x = c(1, 2, 3, 4, 5) > y = x > for (i in 1:length(x)) {y[i] = x[i]^2} > y [1] 1 4 9 16 25 > apply(as.array(x), 1, "^", 2) [1] 1 4 9 16 25 > x^2 [1] 1 4 9 16 25
Loops, functions, etc. > x = 1:5 > f3 = function(x) {return(x^3)} > apply(as.array(x), 1, f3) [1] 1 8 27 64 125 > source("~/test.r") [1] -1 -1 9 16 25
One of the most useful commands ? > ?apply
Practice: on Bordeaux wines • Problem • Bordeaux wine vintage quality and the weather • Bordeaux wines in different vintage years have different qualities (reflected in prices) • The older the better? • Weather is an important factor • Hot, dry summer preferred
Practice: the data WRAIN Winter (Oct.-March) Rain ML DEGREES Average Temperature (Deg Cent.) April-Sept. HRAIN Harvest (August and Sept.) ML TIME_SV Time since Vintage (Years)
Practice: load the data > wine_data = read.table("~/wine.data", header = TRUE, na.strings = ".");
Practice: visualization > plot(wine_data$TIME_SV, wine_data$LPRICE2);
Practice: visualization avg_price = median(wine_data$LPRICE2, na.rm = TRUE); plot(wine_data$DEGREES, wine_data$HRAIN, type = "n", xlab = "Temperature", ylab = "Harvest rain"); points(wine_data$DEGREES[wine_data$LPRICE2 >= avg_price], wine_data$HRAIN[wine_data$LPRICE2 >= avg_price], pch = 19, col = "blue"); points(wine_data$DEGREES[wine_data$LPRICE2 < avg_price], wine_data$HRAIN[wine_data$LPRICE2 < avg_price], pch = 19, col = "red"); legend(15, 250, c(">= avg price", "< avg price"), pch = 19, col = c("blue", "red"));
Practice: linear regression • Find a set of parameters a, …, e, such that: • LPRICE2 ~ a * WRAIN + b * DEGREES + c * HRAIN + d * TIME_SV + e + error_term • The overall error should be minimized • In this case, the sum/average of squared errors • Sum((prediction - actual_price)^2)
Practice: linear regression > lmfit = lm(LPRICE2 ~ WRAIN + DEGREES + HRAIN + TIME_SV, wine_data); > lmfit … Coefficients: (Intercept) WRAIN DEGREES HRAIN TIME_SV -12.145334 0.001167 0.616392 -0.003861 0.023847 > cat("RMS: ", sqrt(sum(lmfit$residuals^2)/length(lmfit$residuals)), "\n"); RMS: 0.2586167
Practice: linear regression plot(wine_data$VINT, wine_data$LPRICE2, xlab = "Vintage year", ylab = "log2 rel. price”, pch = 19, col = "black"); points(wine_data$VINT[30:38], predict(lmfit, wine_data[30:38,]), pch = 19, col = "red"); legend(1965, -0.2, c("old data", "prediction"), pch = 19, col = c("black", "red"));
Practice: linear regression • Using fewer parameters in the model? • LPRICE2 ~ b * DEGREES + c * HRAIN + d + error_term • lmfit2 = lm(LPRICE2 ~ DEGREES + HRAIN, wine_data); • RMS: 0.349513
Links • Classesv2: http://classesv2.yale.edu/ • Course wiki: http://lab.zoo.cs.yale.edu/cs445-wiki/ • R: http://www.r-project.org/ • Bordeaux wine analysis: http://www.liquidasset.com/orley.htm