1 / 45

A Core Curriculum for Undergraduate Data Science

A Core Curriculum for Undergraduate Data Science. Chris Malone. Tisha Hooks. Todd Iverson. Brant Deppa. Silas Bergen. April Kerby. Winona State University. Block 4 – Statistical Learning Theory Thursday, 8:30-12:30  . Data Management. Visualization. Statistical Learning.

wmcmurry
Télécharger la présentation

A Core Curriculum for Undergraduate Data Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Core Curriculum for Undergraduate Data Science Chris Malone Tisha Hooks Todd Iverson Brant Deppa Silas Bergen April Kerby Winona State University

  2. Block 4 – Statistical Learning TheoryThursday, 8:30-12:30   Data Management Visualization Statistical Learning DSCI 210 Survey DSCI 210 Survey DSCI 310 Intro to Viz DSCI 210 Survey STAT 210/310 Statistics DSCI 325 Structured DSCI 330 Unstructured STAT 360 Regression DSCI 415 Unsupervised DSCI 425 Supervised DSCI 430 DS @ Scale

  3. Unsupervised LearningDSCI 415 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu

  4. Course Topics • Introduction to Unsupervised Learning • Review of basic and advanced graphics in R • Measuring distance • Multidimensional scaling (MDS) • Principal component analysis (PCA) • Cluster analysis and cluster interpretation • Correspondence analysis • Recommender Systems • Text mining and sentiment analysis  VERY IMPORTANT!!!  “Baby NLP”

  5. Unsupervised Learning vs. Supervised Learning Supervised Learning - 1st Day of Class • Say “Hello” • Show them course website • Begin reviewing regression with emphasis on prediction Unsupervised Learning - 1st Day of Class • Say “Hello” • Show them course website • Cancel class for remainder of semester • Leave room and get a coffee

  6. Unsupervised Learning vs. Supervised Learning Supervised Learning - • Response of interest • Set of predictors • Build models using terms based on to predict Unsupervised Learning • No response of interest • Set of observations on variables • Uncover structure in these data to gain insights • Unstructured data -

  7. Measuring Distance • Distance between observations - numeric variables, scaling, metrics - non-numeric variables (ordinal/nominal) - mixture of variable types • Dissimilarity/Similarity between variables • Distance/Dissimilarity Matrices - MDS, PCA, Clustering, Recommender Systems, Correspondence Analysis

  8. Multidimensional Scaling (MDS) • Start with distance or dissimilarity matrix • Metric MDS – find lower dimensional representation that preserves actual pairwise distances. • Non-Metric MDS – find a lower dimensional representation that maximizes rank correlation between distances.

  9. Example: Mushroom Characteristics(n = 4,062)

  10. Each level of every variable is turned into a dummy variable, 1 if mushroom has trait (red), 0 if not (white).

  11. Distance/Dissimilarity Matrix

  12. Ecological Field Studies

  13. Principal Components Analysis (PCA) Dimension reduction method - Eigenanalysis/spectral decomposition or - SVD of data matrix

  14. Example: Image Compression SVD of the pixel matrix allows for lower dimensional representation of the image. From top left to bottom right:

  15. Genetic Analysis of Europeans Genes mirror geography within Europe (Novembre, et al., Nature 2008) Sample of approximately 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome. (Love to have the raw data from this one, cannot find it…)

  16. Activity: Liver Toxicity Study • The data come from a liver toxicity study (Bushel et al., 2007) in which male rats of the inbred strain Fisher 344 were exposed to non-toxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg), or severely toxic (2000 mg/kg) doses of acetaminophen (paracetamol) in a controlled experiment. Necropsies were performed at 6, 18, 24 and 48 hours after exposure and the mRNA from the liver was extracted. Ten clinical chemistry measurements of variables containing markers for liver injury are available for each subject and the serum enzymes levels are measured numerically. The data were further normalized and pre-processed by Bushel et al. (2007). • Variables • Animal ID – ID number for rat • Treatment.Group – conveys both dose level and time animal was sacrificed • Dose.Group – 50, 150, 1500, or 2000 mg/kg • Time.Group – time sacrificed (6,18, 24, or 48 hrs.) • 10 clinical markers for liver injury and serum enzyme levels • BUN.mg.dL, Creat.mg.dL, TP.g.dL, ALB.g.dL, ALT.IU.L, SDH.IU.L, AST.IU.I, ALP.IU.L, TBA.umol.L, Cholesterol.mg.dL • 3,116 genetic expression levels • A_43_P14555, A_45_P2220, … , A_42_P720241 • Clearly these data are high dimensional and therefore a good candidate for employing dimension reduction methods.

  17. Activity: Liver Toxicity Study Use principal component analysis to see if the measurements are associated with dose received and time of exposer to acetaminophen.

  18. Independent Component Analysis (ICA) Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. Finds independent sub-parts of a complex dataset ICA is superficially related to PCA and/or Factor Analysis Useful in separating mixed signals into independent sources.http://research.ics.aalto.fi/ica/icademo/

  19. Example: PCA of Brazilian Facial Images First four eigenfaces

  20. Example: ICA of Brazilian Facial Images 200 Independent Components https://mediaspace.minnstate.edu/media/Face+28+ICA+2/0_zmbc0iid

  21. Activity: Cocktail Party Problem with Images The other night my TV was acting up and the entire screen turned to static. While the video was complete scrambled, I swore I could hear maniacal laughing amongst the accompanying audio noise, that was pretty much a static also. I took several burst pictures (actually 1,000) of the screen with my iPhone and converted these pictures to pixel grayscale JPEG image files. Below are a sample of four of these images.

  22. Activity: Cocktail Party Problem with Images I suspect that multiple channels of input were coming through my cable box at the same time. Combine this with the fact that my HDMI cable was loose, which I am sure added considerable noise, and you have the static-filled images to the right. If I am correct about the multiple input sources, independent component analysis (ICA) should prove useful in detangling these source images and finding the underlying independent source images. (Run R Script File)

  23. Cluster Analysis Given a set of variables measured on each object/subject, cluster analysis seeks to find groups, or clusters, of similar observations.

  24. Cluster Analysis Hierarchical cluster analysis (HCA) • distance matrix – many possible metrics • observations are fused successively into clusters until observations are in one cluster • linkage is how the distance between clusters is determined. HCA is an example of an agglomerative method, points start as individuals and then are joined to form clusters.

  25. Cluster Analysis clusters

  26. Cluster Analysis

  27. Partitioning Methods K-means and K-medians clustering

  28. Partitioning Methods

  29. Density-Based Clustering (DBSCAN) Great visualization of this and related methods https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 dbscan - package from CRAN, can be hard to tune but can work well in cases where other methods produce poor results.

  30. Two-way Clustering As we can measure distance between both observations and variables, we can perform cluster analysis for each and display the results using a cell plot.

  31. Activity: Sports Difficulty Sports experts were asked to rate sports on a variety attributes required to excel in the them. Athlete attributes: power, strength, agility, flexibility, hand-eye coordination, nerves, analytic aptitude, endurance, and durability.

  32. Activity: Sports Difficulty Which sports are clustered together on the basis of the athlete attributes? Which attributes are most associated with each other? Which sports appear to require the most attributes? Do the attribute/sport associations make sense given your prior knowledge?

  33. Correspondence Analysis (CA/MCA) Similar to PCA, but is generally applied to categorical or ordinal data. Simple CA is used to visualize the relationship between two categorical variables, each with more than two levels. Multiple CA extends to larger, multivariate datasets with numerous categorical/ordinal variables.

  34. Example 1: Suicides in former W. Germany (1974 – 1979)

  35. Example 1: Suicides in former W. Germany (1974 – 1979)

  36. Example 2: Bird Counts and Sampling Sites

  37. Multiple Correspondence Analysis Multiple correspondence analysis is commonly used to examine structure in a results from a survey. • Survey of individuals – explain variation between individuals. Understand relationship between survey items and subject demographics. • Ecological surveys – look at species abundance at several different sample sites. Understand variation between sites, relationship between species, and relationships to site characteristics.

  38. Example 3: Hobby Survey (n )

  39. Example 3: Hobby Survey (n )

  40. Recommender Systems • Content-based RecommendersIn content-based recommendation systems we determine what it is exactly a user likes about product X and recommend products that have similar properties to product X. Kyle MacLachlan IMDB and Amazon both recommend David Lynch directed films only.

  41. Recommender Systems Collaborative-Filtering In collaborative-filtering, we find other individuals with similar tastes and make recommendations based upon what these other individuals liked. John Sally Ernesto My reviews Based upon movies we have all reviewed Highest rated movies by John, Sally, and Ernesto that I have not seen (or rated or purchased)

  42. Activity: Bob Ross Painting Recommender Suppose that I like the Bob Ross painting Mt. McKinley (Season 1, Episode 2), can you recommend another painting/episode I might also like?

  43. To do list • More image stuff • Improved NLP • Social Network Analysis • Others?

More Related