1 / 20

Using summary statistics to explore data Exploring data using visualization

서울시립대학교 전기전자컴퓨터공학과 G201449015 이가희. 고급컴퓨터알고리듬. 3 Exploring data. Using summary statistics to explore data Exploring data using visualization Finding problems and issues during data exploration. Using summary statistics to exploring data.

Télécharger la présentation

Using summary statistics to explore data Exploring data using visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 서울시립대학교 전기전자컴퓨터공학과 G201449015 이가희 고급컴퓨터알고리듬 3 Exploring data • Using summarystatistics to explore data • Exploring data using visualization • Finding problems and issues during data exploration

  2. Using summary statistics to exploring data ? Summary(data): data의 전반적인 형태를 보여준다. data type - numeric : variety of summary statistics - categorical data(factor & logical) : count statistics custdata <- read.table('custdata.tsv', header=T, sep='\t') str(custdata) summary(custdata) https://github.com/WinVector -> zmPDSwR.zip

  3. Using summary statistics to exploring data ? Summary(data): data의 전반적인 형태를 보여준다. data type - numeric : variety of summary statistics - categorical data(factor & logical) : count statistics custdata <- read.table('custdata.tsv', header=T, sep='\t') str(custdata) summary(custdata) Missing value Invalid value and Outliers Datarange Units

  4. Typical problems reveal by summaries - Missing value!!! MISSING VALUES : 값이 없다. (≠0) drop rows 만이 해결 방법일까? 왜 missing values가 있고, 이것들이 사용할 가치가 있는지 판단할 필요가 있다. “not in the active workforce” (student or stay-at-home partners) only missing a few values -> drop rows!

  5. Typical problems reveal by summaries - Invalid value and Outliers - Datarange INVALID VALUE : 의미 없는 값, missing value -> invalid value ex) non-negative value여야 하는 numeric data (age, income) - negative values DATA RANGE : wide range? narrow range? 무엇을 분석하느냐에 따라 필요한 데이터 범위도 달라진다. ex. 5세에서 10세 사이의 어린이를 위한 읽기능력을 예측 : 유용한 변수 – 연령 20대 이상 -> 데이터 변환 or 빈 연령대로 변환 만약 예측해야 할 문제에 비해 데이터 범위가 좁다면, a rough rule of thumb (평균에 대한 표준편차의 비율) 활용 summary(custdata$income) 0~615,000 : very wide range “amount of debt”-> bad data summary(custdata$age) “age unknown” or “refuse to state”

  6. Typical problems reveal by summaries - Units UNITS : 어떤 단위로 구성되어 있는지 확인해야 한다. days, hours, minutes, kilometers per second, … summary(custdata$income) Income <- custdata$income/1000 summary(Income) 범위 축소 “hourly wage” or “yearly income in units of $1,000”

  7. Spotting problem using graphic and visualization ggplot2() : R에서 기본으로 제공하는 plot()과 유사한 인터페이스를 제공하는 시각화 툴 레이어(layer)를 잘 활용해야 한다. only data.frame 플로팅할 데이터의 column name ggplot(data, aes(x=column, y=column), FUN…) + geometric_object() + FUN… geom_point() (scatter plot) geom_line() (line plot) geom_bar() (bar chart) geom_density (density plot) geom_histogram (histogram) … aesthetic mapping : 데이터를 플로팅할때 쓴다. • ggplot(custdata, aes(x=age)) + • geom_density() • ggplot(custdata) + • geom_density(aes(x=age)) outliers invalid values? http://ggplot2.org

  8. Spotting problem using graphic and visualization - A single variable 1 HISTOGRAM : bin을 기준으로 데이터의 분포를 보여준다. examines data range check number of modes checks if distribution is normal/lognormal checks for anomalies and outliers • ggplot(custdata) + • geom_histogram(aes(x=age), binwidth=5, fill='gray') invalid values outliers

  9. Spotting problem using graphic and visualization - A single variable 2 DENSITY PLOT : bin에 따라 그래프의 모양이 변하는 히스토그램에 비해 그래프 모양이 변하지 않는다. bin의 경계에서 분포가 확연히 달라지지 않는다. (곡선형태) examines data range check number of modes checks if distribution is normal/lognormal checks for anomalies and outliers • ggplot(custdata) + • geom_density(aes(x=income)) + • scale_x_continuous(labels=dollar) continuous position scales

  10. Spotting problem using graphic and visualization - A single variable 3 LOG-SCALED DENSITY PLOT : 로그 밀도 그래프 • ggplot(custdata) + • geom_density(aes(x=income)) + • scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) + • annotation_logticks(sides='bt') log tick on bottom and top (default) annotation: log tick marks

  11. Spotting problem using graphic and visualization - A single variable 4 BAR CHART : compares relative or absolute frequencies of the values of a categorical variable • ggplot(custdata) + • geom_bar(aes(x=marital.stat), fill='gray')

  12. Spotting problem using graphic and visualization - A single variable 5 HORIZONTAL BAR CHART • ggplot(custdata) + • geom_bar(aes(x=state.of.res), fill='gray') + • coord_flip() + • theme(axis.text.y=element_text(size=rel(0.8))) relative sizing for theme elements to modify theme settings flipped cartesian coordinates

  13. Spotting problem using graphic and visualization - A single variable 5 HORIZONTAL BAR CHART • statesums <- table(custdata$state.of.res) • statef <- as.data.frame(statesums) • colnames(statef) <- c('state.of.res', 'count') • statef <- transform(statef, state.of.res=reorder(state.of.res, count)) • ggplot(statef) + • geom_bar(aes(x=state.of.res, y=count), stat='identity', fill='gray') + • coord_flip() + • theme(axis.text.y=element_text(size=rel(0.8))) reorder levels of a factor

  14. Spotting problem using graphic and visualization - Relationship two variables 6 STACKED BAR CHART : var1값 안에서의 var2값의 분포를 보여준다. 7 SIDE-BY-SIDE BAR CHART : 각각의 var1에 대한 var2값을 나란히 배치 8 FILLED BAR CHART : 일정한 틀 안에서 var2의 상대적인 비율을 보여준다. • ggplot(custdata) + • geom_bar(aes(x=marital.stat, fill=health.ins), ) , position=‘dodge' , position=‘fill'

  15. Spotting problem using graphic and visualization - Relationship two variables 9 BAR CHART WITH FACETING : a large number of categories를 가진 column들을 차트로 나타냈을 때, 각각의 항목에 대해 나눠서 보자 • custdata2 <- subset(custdata, • (custdata$age>0 & custdata$age<100 & custdata$income>0)) • ggplot(custdata2) + • geom_bar(aes(x=housing.type, fill=marital.stat), position='dodge') + • theme(axis.text.x=element_text(angle=45, hjust=1)) • ggplot(custdata2) + • geom_bar(aes(x=marital.stat), position='dodge', fill='darkgray') + • facet_wrap(~housing.type, scales='free_y') + • theme(axis.text.x=element_text(angle=45, hjust=1)) horizontal justification should scales be free in one dimension default(fixed) 분포를 거의 알아보기 힘들다.

  16. Spotting problem using graphic and visualization - Relationship two variables 10 LINE PLOT : 두 변수간의 연관성을 볼 수 있다. 하지만, 데이터가 서로 관련이 없으면 유용하지 않다. • x <- runif(100) • y <- x^2 + 0.2*x • ggplot(data.frame(x=x, y=y), aes(x=x, y=y)) + • geom_line()

  17. Spotting problem using graphic and visualization - Relationship two variables 11 SCATTER PLOT + α : two numeric variables relationship! Q. age, income … relationship? 연관관계를 알아보기 힘들다 • cor(custdata2$age, custdata2$income) • ggplot(custdata2, aes(x=age, y=income)) + • geom_point() + • ylim(0, 200000) • ggplot(custdata2, aes(x=age, y=income)) + • geom_point() + • stat_smooth(method='lm') + • ylim(0, 200000) correlation smoothing method 선 그리기 ??? * se (default) = true

  18. Spotting problem using graphic and visualization - Relationship two variables 12 SMOOTHING CURVE • ggplot(custdata2, aes(x=age, y=income)) + • geom_point() + • geom_smooth() + • ylim(0, 200000) • ggplot(custdata2, aes(x=age, y=as.numeric(health.ins))) + • geom_point(position=position_jitter(w=0.05, h=0.05)) + • geom_smooth() a smoothed conditional mean continuous + a boolean ~ 40 : increase 55 ~ : decrease

  19. Spotting problem using graphic and visualization - Relationship two variables 13 HEXBIN PLOT : 2-dimensional histogram ggplot(custdata2, aes(x=age, y=income)) + geom_hex(binwidth=c(5, 10000)) + geom_smooth(color='white', se=F) + ylim(0, 200000)

  20. Key point! • 모델링 하기 전에 데이터를 살펴보는 시간을 갖자. • Summary() : helps you spot issues • with data range, units, data type, and missing or invalid values. • Visualization : 변수 사이의 데이터 분포와 이들 간의 관계성을 보는데 도움을 준다.

More Related