1 / 34

Statistical Methods in Computer Science

Descriptive Statistics Data 1: Frequency Distributions Ido Dagan. Statistical Methods in Computer Science. Concrete Theory: Relates Variables to Each Other. Examples: Mathematically accurate Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20 Asymptotically correct

dallon
Télécharger la présentation

Statistical Methods in Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Descriptive StatisticsData 1: Frequency Distributions Ido Dagan Statistical Methods in Computer Science

  2. Concrete Theory: Relates Variables to Each Other • Examples: • Mathematically accurate • Memory = 2*sizeof(input) + 3 • Runtime = 500 + 30*sizeof(input) + 20 • Asymptotically correct • Memory = O(sizeof(input)) in worst case, • Runtime = O(log (sizeof(input))) in best case • Accuracy is proportional to run-time • Qualitative • User performance is increased with reduced cognitive load • number of bugs discovered is monotonically decreasing, but positive, if the same programmer is used, otherwise, it increases

  3. Behavior Parameters/Variables(typical of Computer Science) • Hardware parameters • CPU model and organization, cache organization, latencies in the system System parameters • Memory availability, usage • CPU running time (sometimes approximated by world-clock time) • Communication bandwidth, usage • Program characteristics • requires floating-point, heavy disk usage, integer math, graphics • large heap, large stack, uses non-local information, ...

  4. Scales of Measurements • Nominal (also called categorical): No order, just labels • e.g., “Algorithm Name” • Ordinal (also called rank): Order, but not numerical • Difference between ranks is not necessarily the same • e.g., ranks in (hierarchical/military) organization • Interval: Difference between values has same meaning everywhere • e.g., temperature in Celsius (rise of 10 degrees is the same everywhere) • But 100C is not twice as hot as 50C, and 0C is not lack of heat • Ratio: Interval + Fixed zero point • e.g., robot position, memory usage, run-time

  5. Scale Hierarchy • Nominal < Ordinal < Interval < Ratio • Propositions that are true for some level, are true above it • But not necessarily the other way around • e.g., we can calculate the mean (average) value for numerical variables • But not for nominal and ordinal • e.g., we can calculate the most frequent value for all variables http://en.wikipedia.org/wiki/Levels_of_measurement “Numerical”

  6. Variables • Discrete: • Can take on only certain values: symbols, exact numbers • For ordinal, interval and ratio scales, this means there will be gaps • e.g., User satisfaction surveys, memory usage Continuous: • Can take on any value within its range: no gaps • e.g., run-time, CPU temperature, robot velocity and position • In practice: limited by measurement accuracy • Up to researcher to determine needed accuracy

  7. Data • The collection of values that a variable X took during the measurement

  8. Describing Data • Our task: • Describe the data we have collected • Find ways to characterize it, represent it • Find properties that are true of the data

  9. Data Distribution • The collection of data is called the sample distribution • We will investigate distributions: • Find values that “best” represent a distribution • Measure their dispersion, range, shape • Identify extraordinary values in a distribution • Find visual representations for a distribution • Remember hierarchy: Nominal < Ordinal < Interval < Ratio • Think about how the following techniques apply

  10. Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x.

  11. Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x. ?

  12. Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x. Convention (Ordinal/Numerical): Sort by value

  13. Grouped Frequency Distributions • In ordinal/numerical variables, possible to group values together • Create Grouped Frequency Distributions

  14. Grouped Frequency Distributions • In ordinal/numerical variables, possible to group values together • Create Grouped Frequency Distributions Warning: Loss of Information

  15. Real and Apparent Limits • Continuous values are more difficult to divide into intervals • Score of 95 falls within 95-99, not within 90-94 • But what about temperature of 94.87 ? 94 < 94.87 < 95 ! • By convention, the real limits of a score are within ½ the measurement resolution • If our resolution is 0.1, then limits are within 0.05 • If our resolution 100, then limits are within 50 • Note: we break convention only for exceptional cases • e.g., age: “I am 35” is true of [35.0 .. 36.0)

  16. Real/Apparent Limits • For example: • Resolution of 0.01. Interval 95..99 really covers values 94.995 to 99.005 • Apparent limits: 95..99 • Real limits: 94.995 to 99.005 • Resolution of 10: 740-800 really covers values 735 to 805.

  17. Relative Frequency Distributions • A frequency count can be misleading • Algorithm X was fastest on 60,000 trials: Is this good? • 100,000 people voted for candidate A: Is she the winner? • Relative frequency distributions: translate f into percentage or ratio • rel f (proportion) = f/N • rel f (%) = 100 * f/N • Warning: Can be misleading, if ignoring count magnitude • 50% of all test cases succeeded (with only two cases…)

  18. Relative Frequency Distributions • Example: f/N

  19. Cumulative Frequency Distribution • For ordinal/numerical variables • Where values are with respect to others: How many below or above Cumulative frequency distribution

  20. Cumulative Frequency Distribution • Based on the cumulative distribution, can answer question such as: • What percentage of scores fall below 80? • How many scores below 95?

  21. Percentiles, Percentile Ranks • (Value of) Percentile X: Value for which X percent of values are lower • e.g. baby height • We use Px to denote the Xth percentile, e.g., P98 is in range 90-94. • Percentile rank of value X: the percent of values that fall below X. • e.g., percentile rank of the interval 65-69 is 12.

  22. Computing Percentiles, P. Ranks • How do we compute percentiles and percentile ranks from grouped data? • What is the score which defines the top 20% of scores? • Is it between 84 and 85?

  23. Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

  24. Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

  25. Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). • We need 8 more.

  26. Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). • We need 8 more. • The interval 85-89 contains 47-32 = 15 cases. • real limit 84.5 • These are spread over width of 5 (= 89.5-84.5). • Assume scores are evenly distributed within interval • 8 more cases ==> 8/15 * 5 = 2.67 (linear interpolation) • P80 = 84.5 + 2.67 = 87.17

  27. Computing Percentile Ranks • We want to compute the percentile rank of 86 • Lies in the interval 85-89, real limits 84.5 – 89.5. • 86-84.5 = 1.5 score points. • Width of interval = 5. Assuming uniform spread of scores in interval:1.5/5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5)

  28. Computing Percentile Ranks • We want to compute the percentile rank of 86 • Lies in the interval 85-89, real limits 84.5 – 89.5. • 86-84.5 = 1.5 score points. • Width of interval = 5. 1.5/5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5) • So we have 32 scores up to 84.5 • 4.5 scores from 84.5 to 86. • Total: 4.5 + 32 = 36.5 scores. • 36.5/50 = 73%. This is the percentile rank of 86.

  29. Frequency Distributions and Scales

  30. Displaying Frequency Distributions:Nominal Data

  31. Displaying Frequency Distributions:Ordinal/Numerical Data • Histogram

  32. Displaying Frequency Distributions:Ordinal/Numerical Data • Histogram: Different Grouping

  33. Lying with Visuals

  34. Characteristics of Distributions • Shape, Central Tendency, Variability Different Central Tendency Different Variability

More Related