1 / 54

Summarizing Data

Summarizing Data. Statistics. probability. sampling. inference. statistics. probability vs. statistics. Distribution ?. Distribution :. A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. population distribution

nyx
Télécharger la présentation

Summarizing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarizing Data

  2. Statistics

  3. probability sampling inference statistics probability vs. statistics

  4. Distribution ?

  5. Distribution : A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. • population distribution • distribution of a sample

  6. statistics pop’n dist’n dist’n of a sample realistic imaginary data Theory (model)

  7. Statistics starts from data.

  8. Data are not just sets of numbers. Data are clues to truth, and say about truth.

  9. The 1st principle of statistics : The sample is not the same with the population, but the population is represented by the sample sufficiently well.

  10. Datawork

  11. Woodwork & Datawork • From real world • Data collecting • Exploring data • Reducing data • Modeling • Evaluating • From forest • Making timber • Inspecting wood grain • Cutting • Structuring • Finishing

  12. Craft & Endeavor

  13. Tools & Skills

  14. Statistical tools • Paper, pencil & calculator • Spreadsheet SW (Excel) • Minitab, SPSS, SAS, R • DBMS ( Access, Oracle, …) • C/C++, Java, Python, … You need skill to use these.

  15. Also, you need craft & experiences. However, the more important point in datawork is trying to get perspectives of the data on your hand.

  16. No typical ways for good datawork. Think, think and think ! That’s the only way.

  17. 살라카둘라 메치카불라 비비디 바비디 부 -- Datawork is not a miagic. It's a hard job.

  18. Wood grain ?

  19. Grain of data ?

  20. Seeing the grain of data ≈ Exploratory Data Analysis

  21. Exploratory Data Analysis (EDA) The step to check the basic properties of data, by using the basic statistical methods. From EDA, we aim to develop insight on data, as a first step for more specific analysis.

  22. Basic Statistical Methods Qualitative variable • frequency table • crosstabulation (contingency table) • bar chart, pie chart, ….

  23. Basic Statistical Methods Quantitative scale • (cumulative) frequency distribution • histogram • dot-plot • stem & leaf diagram • scatter plot • box plot, ….

  24. Example Data Credit_Card_Bank: p22 of SVV • 12 var’s & 100 obs’s • Many types of ‘offer’ to cardholders • To find the type of ‘offer’ that increases • cardholder’s usage maximally.

  25. [1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No) loct08 = log(oct08) oct08 mseg iseg data.svv<-dir("c:/temp/text") dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv) oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0] mseg<-dsv[,5]; iseg<-dsv[,6]

  26. log(oct08): [1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97 [11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85 [21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46 [31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03 [41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93 [51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05 [61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63 [71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61 [81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65 [91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30 Rounded up to 2nd decimal log(0) = - Inf round(loct08,2)

  27. Sorted values of log(oct08): [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 after deleting 7 cases of –Inf. round(sort(xoct08,2)

  28. iseg [1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T Meaning of the levels are not known.

  29. mseg [1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H L: low, B: below medium, M: medium, A: above medium, H: high levels(mseg)<-c("M","H","L","A","B") mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg

  30. Histogram of loct08 20 15 Frequency 10 5 0 2 4 6 8 10 loct08 hist(xoct08,col="grey")

  31. Stem and leaf display: leaf unit = 0.1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 2.5 a stem a leaf stem(xoct08)

  32. leaf unit = 1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 25 stem(10*xoct08)

  33. 5 number summary of log(oct08): Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420 IQR = 2.119 summary(xoct08)

  34. Quartiles : Q1, Q2 , Q3 Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest Median = Q2 IQR (Inter-Quartile Range) = Q3 – Q1

  35. How to take : Q1, Q2, Q3 Q1 : c = 0.25*(n+1) Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1) If c is an integer, then c-th ranked value x[c] If c is not an integer, then (x[c-]+ x[c+])/2 c- : the largest lower integer than c c+ : the smallest upper integer than c

  36. Sorted values of log(oct08): [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 after deleting 7 cases of – Inf. n= 93 , 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5

  37. Dot plot 2 4 8 10 12 6 loct08

  38. Boxplot of log(oct08) Boxplot oct08 10 30000 25000 8 20000 15000 6 10000 4 5000 0 boxplot(oct08) boxplot(xoct08)

  39. mild-outlier extreme-outlier * * Q1 Q2 Q3 IQR min(non-outlier) min(non-outlier) 1.5 IQR

  40. Frequency table freq %freq cum. freq %cum. freq Low Spender 26 0.26 26 0.26 Med Low Spender 20 0.20 46 0.46 Average Spender 11 0.11 57 0.57 Med High Spender 25 0.25 82 0.82 High Spender 18 0.18 100 1.00 ------------------------------------------------------------ Total 100 1.00 table(mseg) table(mseg)/length(mseg) cumsum(table(mseg)) cumsum(table(mseg))/length(mseg)

  41. Bar chart of log(oct08) 20 15 10 5 0 (2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

  42. Histogram & Bar chart Histogram : for quantitative variables connected bar’s Bar chart : for categorical variables disconnected bar’s

  43. Contingency table of mseg and iseg iseg A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18 Total 31 32 10 27 100 mseg table(mseg,iseg) apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)

  44. Pie chart of iseg A 31 B 32 27 10 T R pie(table(iseg),col=c("red","light green","green","blue"))

  45. 30 25 20 15 10 5 0 A B R T Segmented bar chart of (mseg, iseg) - serial barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

  46. 12 10 8 6 4 2 0 A B R T Segmented bar chart of (mseg, iseg) - parallel barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

  47. Mosaic Plot A B R T L B mseg M A H iseg mosaicplot(~iseg+mseg,col=rainbow(5))

More Related