Introduction to Statistical Learning and Practical Applications

Join our course to explore statistical learning concepts, problems, and implementations using R and data analysis tools. Learn supervised and unsupervised learning techniques.

Introduction to Statistical Learning and Practical Applications

  1. 1-Pembelajaran Mesin19 Agustus 2015 Perkenalan dan silabus Pengantar Statistical Learning Problems Praktikum: R studio, R basic, Data frame & basic plot

  2. Perkenalan Singkat • Dosen: Hapnes Toba (lt. 8 GWM) • Kontak: hapnestoba@gmail.com / 0818.0905.7037 • Riset: • Information retrieval • Text analysis • Information extraction • Machine learning • Data mining

  3. Waktu Perkuliahan • Teori: Rabu, 7:30-09:45 (istirahat 15 menit) • Praktikum: Rabu, 10:00-12:00 • Ruang: Lab. Adv. 3 • Kesepakatan: • Tepat waktu (15 menit toleransi) • Gadget non-aktif: emergency silakan ke luar kelas dulu • Silakan buat janji, jika ingin diskusi di luar sesi perkuliahan • Kejujuran akademis: copas, nyontek, dkk sejenisnya harap dihindari

  4. Penilaian dan Silabus (dibagikan) • Teori • Praktikum: R statistical language, analisis data melalui kasus (2-3 orang), plus point: Java / C# (R-interfacing)

  5. Lain-lain • Setiap kelompok (2-3 orang), wajib membuat folder khusus melalui cloud (Google Drive, Microsoft OneDrive, GitHub atau lainnya) untuk tugas-tugas yang diberikan. Setiap kali pengumpulan wajib membagikan tautan folder tersebut kepada dosen. • Tugas dan materi perkuliahan dapat diperoleh melalui http://sitoba.itmaranatha.org • Terkait waktu pengumpulan tugas-tugas, harap memperhatikan jadwal yang telah diinformasikan. Keterlambatan memiliki konsekuensi pemotongan nilai 25% per hari keterlambatan (4 hari keterlambatan berarti nilai = 0). • Penyalinan jawaban antar kelompok, baik itu sebagian ataupun keseluruhan akan berakibat nilai 0 untuk tugas yang bersangkutan dan berlaku untuk semua kelompok yang terlibat. • Kuis dapat berupa materi teori maupun pemrograman, dan akan diinformasikan di kelas.

  6. Buku Pegangan dan Referensi • An Introduction to Statistical Learning: With Applications in R (ISLR). (2014). James, G., Witten, D., & Hastie, T. Springer. ISBN: 978-1-4614-7137-0. Website:www.StatLearning.com

  10. Statistical Learning Problems:Identify the risk factors for prostate cancer

  11. Classify a recorded phoneme based on a log-periodogram

  12. Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

  13. Customize an email spam detection system

  14. Identify the numbers in a handwritten zip code

  15. Classify a tissue sample into one of several cancer classes, based on a gene expression prole.

  16. Establish the relationship between salary and demographic variables in population survey data

  17. Classify the pixels in a LANDSAT image, by usage

  18. The Supervised Learning Problem • Outcome measurement Y (also called dependent variable, response, target). • Vector of p predictor measurements X (also called inputs, regressors, covariates, features, independent variables). • In the regression problem, Y is quantitative (e.g price, blood pressure). • In the classication problem, Y takes values in a nite, unordered set (survived/died, digit 0-9, cancer class of tissue sample). • We have training data (x1; y1); …; (xN; yN). These are observations (examples, instances) of these measurements.

  19. Objectives • On the basis of the training data we would like to: • Accurately predict unseen test cases. • Understand which inputs aect the outcome, and how. • Assess the quality of our predictions and inferences.

  20. Philosophy • It is important to understand the ideas behind the various techniques, in order to know how and when to use them. • One has to understand the simpler methods first, in order to grasp the more sophisticated ones. • It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!] • This is an exciting research area, having important applications in science, industry and nance. • Statistical learning is a fundamental ingredient in the training of a modern data scientist.

  21. Unsupervised learning • No outcome variable, just a set of predictors (features) measured on a set of samples. • objective is more fuzzy: • find groups of samples that behave similarly, • find features that behave similarly, • find linear combinations of features with the most variation. • difficult to know how well your are doing. • different from supervised learning, but can be useful as a pre-processing step for supervised learning.

  22. Statistical Learning versus Machine Learning • Machine learning arose as a subeld of Artificial Intelligence. • Statistical learning arose as a subeld of Statistics. • There is much overlap - both elds focus on supervised and unsupervised problems: • Machine learning has a greater emphasis on large scale applications and prediction accuracy. • Statistical learning emphasizes models and their interpretability, and precision and uncertainty. • But the distinction has become more and more blurred, and there is a great deal of “cross-fertilization”. • Machine learning has the upper hand in Marketing!

  23. ISLR Premises • Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences. • Statistical learning should not be viewed as a series of black boxes. • While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box! • We presume that the reader is interested in applying statistical learning methods to real-world problems

  24. Notation and Simple Linear Algebra

  25. ISLR and MASS Library / Package

  26. R Introduction • R is the GNU of S. S is a language for statisticians developed at Bell Laboratories by John Chambers et al. • R is designed by John Chambers and developed by the R Foundation. • R is a language and environment for statistical computing and graphics • R is the de facto standard to develop statistical software • R implements variety of statistical and graphical techniques (linear and nonlinear modeling, statistical tests, time series analysis, classication, clustering, ...)

  27. R Provides • Efective data handling and storage • Operators for calculations on arrays (matrices) • A large, coherent, integrated collection of intermediate tools for data analysis • Graphical facilities for data analysis and display • Simple and efective programming language (conditionals, loops, user difined recursive functions) • Extension mechanism with a large collection of packages

  28. Why R • R is Open-Source and free to use (http://cran.r-project.org/) • R has a large and active community • R provides state-of-the-art algorithm (> 3000 extension packages on CRAN, 2011) • R creates beautiful visualizations (as seen in the New York Times and The Economist) • R is used widely in industry (Revolution oers commercial solutions) • R can be easily paralellized • R is getting ready for big data (Revolution Analytics)

  29. R Studio (http://www.rstudio.com/) Environment / Workspace Console Working dir / plots/ packages / help / viewer

  30. R Objects • R handles everything as objects. You can store any numbers or characters in an R object your name. For example, > a <- 5 • We now have an object called a which contains the number 5. We can display the content of an object simply by typing its name: > a • If we want to use characters, they should be specified using either single or double quotes. For example, > myname <- “Tom.Smith” • Note that R is case sensitive, and you cannot use spaces or special characters for object names. When necessary, it is recommendable to use . (period) to connect words. • R stores all objects you create in the workspace (i.e., R console). You can remove R objects by using rm(). > rm(myname) • Now, if we call myname, we will get an error message. > myname

  31. R Vectors • You can assign a set of numbers or characters (i.e. a vector) to an R object, using the concatenate command, c(). > b <- c(-1, 0, 1.5, 30, -12) > my.team <- c(“Laura”, “Bob”, “Megan”) • As before, we can get the content of the object by typing its name b. We can also extract individual elements of a vector, using square brackets to indicate the index that we want. So, x[i] returns i-th element in vector x: > b[3] > my.team[2] • Note that numbers and characters cannot be mixed in the same vector; otherwise, R will assume all elements are characters. • If you want to create a numeric vector with a sequence of numbers, use : (colon). > c <- 1:5

  32. Data Frames • R treats data sets as data frames. • A data frame is like a simple spreadsheet in Excel. • Each column consists of the same object type (numeric or character), with the same number of rows.

  33. Importing Data (1) • Before importing data into the workspace, let’s remove all objects in the workspace. • To list all objects, type ls(). • To remove all objects, type rm(list = ls()). • Even after you removed all objects, R console still remembers all commands you typed during the session until you shut down R. Press the up arrow key to see the previous commands.

  34. Importing Data (2) • First, let's create a data file. You can type in data and create data frames in the R console, but it is much easier to input data in Excel and save it as comma-separated files (.csv), and then load it on the R console. • R reads csv files as well as simple text files (.txt), but it does not read Excel files (.xls, .xlsx). • Let's create the following table in Excel (including column names) and save it as a csv file (you can name it whatever you want).

  35. Create Data *.csv

  36. Import Data • To import the csv file you have just created in the workspace, type: > mydata <- read.csv("directory/to/your/file", header=T) • Or, if you want to select your file by using Windows Explorer (on Windows) or Finder (on Mac), > mydata <- read.csv(file.choose(), header=T) • Now your file is assigned an object name mydata. The header statement (T/F, or TRUE/FALSE) indicates whether the first line of the file is the column names. The default in read.csv() is header=T, so it can be left out if it is the case.

  37. Data Frame Basics (1) • We can select a certain part of a data frame. For example, type: > mydata$score > mydata[1,2] > mydata[1:3,2] > mydata[,3] • $(dollar) specifies a component (column name) of the data frame. • [1,2] specifies the element on the 1st row on the 2nd column. • Similarly, [1:3,2] specifies the elements from the 1st to the 3rd row on the 2nd column, and [,3] specifies all rows on the 3rd column.

  38. Data Frame Basics (2) • You can extract subsets of data frames that meet conditions using logical operators. > subdata <- mydata[mydata$sex == "M", ] > subdata <- mydata[mydata$height > 170, ] • Here, we gave a new object name to the subset of the data, so that it can be used as an object. • To specify the condition, we use logical operators, such as == (equal to), < (less than), <= (less than or equal to), > (greater than), >= (geater than or equal to), != (not equal to). • You can delete rows and columns from a data frame. You can overwrite the same data frame by assigning the same name, but it is wiser to create a new data frame when you modify elements. > subdata <- mydata[-2,] # delete the 2nd row

  39. Data Frame Basics (3) • You can change elements in data frames. For example, > newdata <- mydata > newdata[5,3] <- 153.1 • will overwrite the element on the 5th row, 3rd column in data frame newdata. • In the same way, you can add rows and columns to data frames, using $ (dollar) in the existing data frame. > newdata <- mydata > newdata$age <- c(19, 20, 20, 18, 25)

  40. First look at data • This data set contains morphological measurements on 200 purple rock crabs (Campbell & Mahon 1974, Aus J Zool 22; 417-). The data contain the following columns: • colour: body colour types ("blue" or "orange") • sex: male ("M") or female ("F") • FL: frontal lobe length (mm) • RW: rear width (mm) • CL: carapace length (mm) • CW: carapace width (mm) • BD: body depth (mm)

  41. mean(mydata$FL) • var(mydata$FL) • sd(mydata$FL) • length(mydata$FL) • quantile(mydata$FL, c(0.05, 0.95)) • sum() • max() • min() • median() • range() Try and record your results • mydata <- read.csv(file.choose(), header=T) • head(mydata) • names(mydata) • nrow(mydata) • ncol(mydata) • str(mydata) • summary(mydata) If you don’t know how to use functions, for example: mean(), you can call help by typing help(mean), or simply ?mean.

  42. Histogram • Visualising data is the best way to start investigating them. • First, let’s see the distribution of Frontal Lobe Lengths by creating a histogram. > hist(mydata$FL) • R automatically chooses the number of intervals (bins), but you can change it. > hist(mydata$FL, breaks=50) • You can plot a smooth distribution of the variable (Kernel density estimates) using a function density(). > plot(density(mydata$FL))

  43. Box plot • Now let’s see the Rear Widths of male and female crabs using a box plot. > boxplot(RW ~ sex, data = mydata) • The box is constructed using the interquartile range (25-75%), and the thick line in the middle is the median value. • Whiskers are constructed using the range of the variable (after removing extreme values), and open circles indicate extreme values (no extreme values in this data set).

  44. Scatter Plot • A scatter plot is the convenient way to see the relationship between two variables. • For example, to plot the Carapace Width on the x-axis and the corresponding Body Depth on the y-axis, type > plot(BD ~ CW, data = mydata)

  45. Changing appearances • You can change appearances of the graph by adding arguments in a function plot(). • Here are some useful arguments. • xlim, ylim: range of the x-axis and the y-axis • xlab, ylab: labels for the x-axis and the y-axis • main: main title of the graph • pch: symbol type. See ?points. • cex: symbol size • col: symbol colour by either colour name or index. See colors().

  46. Plot Function > plot(BD ~ CW, data= mydata, main= "Purple rock crabs", xlim=c(20,40), xlab = "Carapace width (mm)", ylab= "Length (mm)", pch=19, col= "blue", cex=0.7) • You can add points and a legend in the existing plot. > points(RW ~ CW, data=mydata, pch=24, col="red", cex=0.7) > legend(20, 20, legend=c("Body depth", "Rear width"), pch=c(19, 24), col=c("blue","red")) • The first two argument in function legend() is x and y positions of the legend.

