Summarizing Data

Summarizing Data

Statistics

probability sampling inference statistics probability vs. statistics

Distribution ?

Distribution : A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. • population distribution • distribution of a sample

statistics pop’n dist’n dist’n of a sample realistic imaginary data Theory (model)

Statistics starts from data.

Data are not just sets of numbers. Data are clues to truth, and say about truth.

The 1st principle of statistics : The sample is not the same with the population, but the population is represented by the sample sufficiently well.

Datawork

Woodwork & Datawork • From real world • Data collecting • Exploring data • Reducing data • Modeling • Evaluating • From forest • Making timber • Inspecting wood grain • Cutting • Structuring • Finishing

Craft & Endeavor

Tools & Skills

Statistical tools • Paper, pencil & calculator • Spreadsheet SW (Excel) • Minitab, SPSS, SAS, R • DBMS ( Access, Oracle, …) • C/C++, Java, Python, … You need skill to use these.

Also, you need craft & experiences. However, the more important point in datawork is trying to get perspectives of the data on your hand.

No typical ways for good datawork. Think, think and think ! That’s the only way.

살라카둘라 메치카불라 비비디 바비디 부 -- Datawork is not a miagic. It's a hard job.

Wood grain ?

Grain of data ?

Seeing the grain of data ≈ Exploratory Data Analysis

Exploratory Data Analysis (EDA) The step to check the basic properties of data, by using the basic statistical methods. From EDA, we aim to develop insight on data, as a first step for more specific analysis.

Basic Statistical Methods Qualitative variable • frequency table • crosstabulation (contingency table) • bar chart, pie chart, ….

Basic Statistical Methods Quantitative scale • (cumulative) frequency distribution • histogram • dot-plot • stem & leaf diagram • scatter plot • box plot, ….

Example Data Credit_Card_Bank: p22 of SVV • 12 var’s & 100 obs’s • Many types of ‘offer’ to cardholders • To find the type of ‘offer’ that increases • cardholder’s usage maximally.

[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No) loct08 = log(oct08) oct08 mseg iseg data.svv<-dir("c:/temp/text") dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv) oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0] mseg<-dsv[,5]; iseg<-dsv[,6]

log(oct08): [1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97 [11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85 [21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46 [31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03 [41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93 [51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05 [61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63 [71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61 [81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65 [91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30 Rounded up to 2nd decimal log(0) = - Inf round(loct08,2)

Sorted values of log(oct08): [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 after deleting 7 cases of –Inf. round(sort(xoct08,2)

iseg [1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T Meaning of the levels are not known.

mseg [1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H L: low, B: below medium, M: medium, A: above medium, H: high levels(mseg)<-c("M","H","L","A","B") mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg

Histogram of loct08 20 15 Frequency 10 5 0 2 4 6 8 10 loct08 hist(xoct08,col="grey")

Stem and leaf display: leaf unit = 0.1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 2.5 a stem a leaf stem(xoct08)

leaf unit = 1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 25 stem(10*xoct08)

5 number summary of log(oct08): Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420 IQR = 2.119 summary(xoct08)

Quartiles : Q1, Q2 , Q3 Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest Median = Q2 IQR (Inter-Quartile Range) = Q3 – Q1

How to take : Q1, Q2, Q3 Q1 : c = 0.25*(n+1) Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1) If c is an integer, then c-th ranked value x[c] If c is not an integer, then (x[c-]+ x[c+])/2 c- : the largest lower integer than c c+ : the smallest upper integer than c

Sorted values of log(oct08): [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 after deleting 7 cases of – Inf. n= 93 , 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5

Dot plot 2 4 8 10 12 6 loct08

Boxplot of log(oct08) Boxplot oct08 10 30000 25000 8 20000 15000 6 10000 4 5000 0 boxplot(oct08) boxplot(xoct08)

mild-outlier extreme-outlier * * Q1 Q2 Q3 IQR min(non-outlier) min(non-outlier) 1.5 IQR

Frequency table freq %freq cum. freq %cum. freq Low Spender 26 0.26 26 0.26 Med Low Spender 20 0.20 46 0.46 Average Spender 11 0.11 57 0.57 Med High Spender 25 0.25 82 0.82 High Spender 18 0.18 100 1.00 ------------------------------------------------------------ Total 100 1.00 table(mseg) table(mseg)/length(mseg) cumsum(table(mseg)) cumsum(table(mseg))/length(mseg)

Bar chart of log(oct08) 20 15 10 5 0 (2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

Histogram & Bar chart Histogram : for quantitative variables connected bar’s Bar chart : for categorical variables disconnected bar’s

Contingency table of mseg and iseg iseg A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18 Total 31 32 10 27 100 mseg table(mseg,iseg) apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)

Pie chart of iseg A 31 B 32 27 10 T R pie(table(iseg),col=c("red","light green","green","blue"))

30 25 20 15 10 5 0 A B R T Segmented bar chart of (mseg, iseg) - serial barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

12 10 8 6 4 2 0 A B R T Segmented bar chart of (mseg, iseg) - parallel barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

Mosaic Plot A B R T L B mseg M A H iseg mosaicplot(~iseg+mseg,col=rainbow(5))

Summarizing Data