A Core Curriculum for Undergraduate Data Science

A Core Curriculum for Undergraduate Data Science Chris Malone Tisha Hooks Todd Iverson Brant Deppa Silas Bergen April Kerby Winona State University

Block 4 – Statistical Learning TheoryThursday, 8:30-12:30 Data Management Visualization Statistical Learning DSCI 210 Survey DSCI 210 Survey DSCI 310 Intro to Viz DSCI 210 Survey STAT 210/310 Statistics DSCI 325 Structured DSCI 330 Unstructured STAT 360 Regression DSCI 415 Unsupervised DSCI 425 Supervised DSCI 430 DS @ Scale

Unsupervised LearningDSCI 415 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu

Course Topics • Introduction to Unsupervised Learning • Review of basic and advanced graphics in R • Measuring distance • Multidimensional scaling (MDS) • Principal component analysis (PCA) • Cluster analysis and cluster interpretation • Correspondence analysis • Recommender Systems • Text mining and sentiment analysis  VERY IMPORTANT!!!  “Baby NLP”

Unsupervised Learning vs. Supervised Learning Supervised Learning - 1st Day of Class • Say “Hello” • Show them course website • Begin reviewing regression with emphasis on prediction Unsupervised Learning - 1st Day of Class • Say “Hello” • Show them course website • Cancel class for remainder of semester • Leave room and get a coffee

Unsupervised Learning vs. Supervised Learning Supervised Learning - • Response of interest • Set of predictors • Build models using terms based on to predict Unsupervised Learning • No response of interest • Set of observations on variables • Uncover structure in these data to gain insights • Unstructured data -

Measuring Distance • Distance between observations - numeric variables, scaling, metrics - non-numeric variables (ordinal/nominal) - mixture of variable types • Dissimilarity/Similarity between variables • Distance/Dissimilarity Matrices - MDS, PCA, Clustering, Recommender Systems, Correspondence Analysis

Multidimensional Scaling (MDS) • Start with distance or dissimilarity matrix • Metric MDS – find lower dimensional representation that preserves actual pairwise distances. • Non-Metric MDS – find a lower dimensional representation that maximizes rank correlation between distances.

Example: Mushroom Characteristics(n = 4,062)

Each level of every variable is turned into a dummy variable, 1 if mushroom has trait (red), 0 if not (white).

Distance/Dissimilarity Matrix

Ecological Field Studies

Principal Components Analysis (PCA) Dimension reduction method - Eigenanalysis/spectral decomposition or - SVD of data matrix

Example: Image Compression SVD of the pixel matrix allows for lower dimensional representation of the image. From top left to bottom right:

Genetic Analysis of Europeans Genes mirror geography within Europe (Novembre, et al., Nature 2008) Sample of approximately 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome. (Love to have the raw data from this one, cannot find it…)

Activity: Liver Toxicity Study • The data come from a liver toxicity study (Bushel et al., 2007) in which male rats of the inbred strain Fisher 344 were exposed to non-toxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg), or severely toxic (2000 mg/kg) doses of acetaminophen (paracetamol) in a controlled experiment. Necropsies were performed at 6, 18, 24 and 48 hours after exposure and the mRNA from the liver was extracted. Ten clinical chemistry measurements of variables containing markers for liver injury are available for each subject and the serum enzymes levels are measured numerically. The data were further normalized and pre-processed by Bushel et al. (2007). • Variables • Animal ID – ID number for rat • Treatment.Group – conveys both dose level and time animal was sacrificed • Dose.Group – 50, 150, 1500, or 2000 mg/kg • Time.Group – time sacrificed (6,18, 24, or 48 hrs.) • 10 clinical markers for liver injury and serum enzyme levels • BUN.mg.dL, Creat.mg.dL, TP.g.dL, ALB.g.dL, ALT.IU.L, SDH.IU.L, AST.IU.I, ALP.IU.L, TBA.umol.L, Cholesterol.mg.dL • 3,116 genetic expression levels • A_43_P14555, A_45_P2220, … , A_42_P720241 • Clearly these data are high dimensional and therefore a good candidate for employing dimension reduction methods.

Activity: Liver Toxicity Study Use principal component analysis to see if the measurements are associated with dose received and time of exposer to acetaminophen.

Independent Component Analysis (ICA) Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. Finds independent sub-parts of a complex dataset ICA is superficially related to PCA and/or Factor Analysis Useful in separating mixed signals into independent sources.http://research.ics.aalto.fi/ica/icademo/

Example: PCA of Brazilian Facial Images First four eigenfaces

Example: ICA of Brazilian Facial Images 200 Independent Components https://mediaspace.minnstate.edu/media/Face+28+ICA+2/0_zmbc0iid

Activity: Cocktail Party Problem with Images The other night my TV was acting up and the entire screen turned to static. While the video was complete scrambled, I swore I could hear maniacal laughing amongst the accompanying audio noise, that was pretty much a static also. I took several burst pictures (actually 1,000) of the screen with my iPhone and converted these pictures to pixel grayscale JPEG image files. Below are a sample of four of these images.

Activity: Cocktail Party Problem with Images I suspect that multiple channels of input were coming through my cable box at the same time. Combine this with the fact that my HDMI cable was loose, which I am sure added considerable noise, and you have the static-filled images to the right. If I am correct about the multiple input sources, independent component analysis (ICA) should prove useful in detangling these source images and finding the underlying independent source images. (Run R Script File)

Cluster Analysis Given a set of variables measured on each object/subject, cluster analysis seeks to find groups, or clusters, of similar observations.

Cluster Analysis Hierarchical cluster analysis (HCA) • distance matrix – many possible metrics • observations are fused successively into clusters until observations are in one cluster • linkage is how the distance between clusters is determined. HCA is an example of an agglomerative method, points start as individuals and then are joined to form clusters.

Cluster Analysis clusters

Cluster Analysis

Partitioning Methods K-means and K-medians clustering

Partitioning Methods

Density-Based Clustering (DBSCAN) Great visualization of this and related methods https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 dbscan - package from CRAN, can be hard to tune but can work well in cases where other methods produce poor results.

Two-way Clustering As we can measure distance between both observations and variables, we can perform cluster analysis for each and display the results using a cell plot.

Activity: Sports Difficulty Sports experts were asked to rate sports on a variety attributes required to excel in the them. Athlete attributes: power, strength, agility, flexibility, hand-eye coordination, nerves, analytic aptitude, endurance, and durability.

Activity: Sports Difficulty Which sports are clustered together on the basis of the athlete attributes? Which attributes are most associated with each other? Which sports appear to require the most attributes? Do the attribute/sport associations make sense given your prior knowledge?

Correspondence Analysis (CA/MCA) Similar to PCA, but is generally applied to categorical or ordinal data. Simple CA is used to visualize the relationship between two categorical variables, each with more than two levels. Multiple CA extends to larger, multivariate datasets with numerous categorical/ordinal variables.

Example 1: Suicides in former W. Germany (1974 – 1979)

Example 2: Bird Counts and Sampling Sites

Multiple Correspondence Analysis Multiple correspondence analysis is commonly used to examine structure in a results from a survey. • Survey of individuals – explain variation between individuals. Understand relationship between survey items and subject demographics. • Ecological surveys – look at species abundance at several different sample sites. Understand variation between sites, relationship between species, and relationships to site characteristics.

Example 3: Hobby Survey (n )

Recommender Systems • Content-based RecommendersIn content-based recommendation systems we determine what it is exactly a user likes about product X and recommend products that have similar properties to product X. Kyle MacLachlan IMDB and Amazon both recommend David Lynch directed films only.

Recommender Systems Collaborative-Filtering In collaborative-filtering, we find other individuals with similar tastes and make recommendations based upon what these other individuals liked. John Sally Ernesto My reviews Based upon movies we have all reviewed Highest rated movies by John, Sally, and Ernesto that I have not seen (or rated or purchased)

Activity: Bob Ross Painting Recommender Suppose that I like the Bob Ross painting Mt. McKinley (Season 1, Episode 2), can you recommend another painting/episode I might also like?

To do list • More image stuff • Improved NLP • Social Network Analysis • Others?

A Core Curriculum for Undergraduate Data Science

A Core Curriculum for Undergraduate Data Science

Presentation Transcript

Integrating Advanced Algorithms into Undergraduate Computer Science Curriculum

Game Development in the Undergraduate Computer Science Curriculum

Undergraduate Curriculum Revision

How can Science Communication enhance the undergraduate curriculum?

Undergraduate Life Sciences Curriculum

Undergraduate Education Advisory Committee Core Curriculum Revision Project

Core Skills (Communication) and the Science curriculum

Core Curriculum

Using Data in Undergraduate Science Classrooms

Implementing Information Literacy Competencies in Undergraduate Science Curriculum

Undergraduate Curriculum – Preparing for Professional Practice

The US National Science Foundation: The Undergraduate Curriculum and Undergraduate Research

Revised Imaging Science Undergraduate Curriculum

Undergraduate Curriculum Revision

Core Curriculum

Core curriculum

Core Curriculum

Data Science Curriculum Download Hyderabad

A national undergraduate curriculum for child health: w hat are the core components?

Challenges for the Undergraduate Curriculum