Linear Search Efficiency Assessment

Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA 99258-0009 chong@gonzaga.edu

Why Search Efficiency? • Information systems help users to obtain “right” information for better decision making, thus require search • Better organization reduces time needed to find this “right” info, thus require sort • Is savings in search worth the sort?

Search Cost • If we assume random access, then the cost is (n+1)/2. That is, for 1000 records, the average search cost is approximately 1000/2=500. • For large n, we may use n/2 to simplify the calculation

Search Cost • In reality, the usage pattern is not random • For payroll, for example, every record is accessed once and only once. In this case sorting have no effect to search efficiency • Most of the time the distribution follows the 80/20 Rule (Pareto Principle)

The Pareto Principle

Group/ Numbers Numbers Paper Cumulative Cumulative Author Paper Group/ Numbers Numbers Paper Cumulative Cumulative Author Paper Index Papers Authors Subtotal Authors Papers Proportion Proportion i ni f(ni) nif(ni) f(ni) nif(ni) xi i 26 242 1 242 1 242 0.003 0.137 25 114 1 114 2 356 0.005 0.202 24 102 1 102 3 458 0.008 0.260 23 95 1 95 4 553 0.011 0.314 22 58 1 58 5 611 0.014 0.347 21 49 1 49 6 660 0.016 0.374 20 34 1 34 7 694 0.019 0.394 19 22 2 44 9 738 0.024 0.419 18 21 2 42 11 780 0.030 0.442 17 20 2 40 13 820 0.035 0.465 16 18 1 18 14 838 0.038 0.475 15 16 4 64 18 902 0.049 0.512 14 15 2 30 20 932 0.054 0.529 13 14 1 14 21 946 0.057 0.537 12 12 2 24 23 970 0.062 0.550 11 11 5 55 28 1025 0.076 0.581 10 10 3 30 31 1055 0.084 0.598 9 9 4 36 35 1091 0.095 0.619 8 8 8 64 43 1155 0.116 0.655 7 8 8 56 51 1121 0.138 0.687 6 6 6 36 57 1247 0.154 0.707 5 5 10 50 67 1297 0.181 0.736 4 4 17 68 84 1365 0.227 0.774 3 3 29 87 113 1452 0.305 0.824 2 2 54 108 167 1560 0.451 0.885 1 1 203 203 370 1763 1.000 1.000 Total number of Groups (m): 26 Average number of publications (m): 4.7649

A Typical Pareto Curve

Formulate the Pareto Curve Chen et al. (1994) define f(ni) = the number of authors with ni papers, T = = total number of authors, R = = total number of papers, m = R/T = the number of published papers per author

Formulate the Pareto Curve for each index level, let xi be the fraction of total number of authors and i be the fraction of total paper published, then xi = and qi = .

Formulate the Pareto Curve Plug in the values above into (i - i+1)/(xi - xi+1), Chen et al. derive the slope formula: si = When ni = 1, si = 1/m = T/R, let’s call this particular slope a.

Revisit the Pareto Curve a = 370/1763 = 0.21

The Significance • We now have a quick way to quantify different usage concentrations • Simulation shows that in most situations a moderate sample size would be sufficient to assess the usage concentration • The inverse of average usage (a) is easy to calculate

Search Cost Calculation • The search cost for a randomly distributed list is n/2. Thus, for 1000 records, the search cost is 500. • For a list that has 80/20 distribution, the search cost is (200/2)(80%)+[(200+1000)/2](20%) = 200 Or a saving of 60%

Search Cost Calculation Let the first number in the 80/20 be a and the second number be b. Since these two numbers are actually percents, we have a + b = 1. Thus, the expected value for searching cost for a list of n records is the weighted average: (bn/2)(a) + [(bn+n)/2](b) = (bn/2)(a+b+1) = (bn/2)(2) = bn

Search Cost Calculation • Thus, b indicates the cost of search in terms of the percentage of records in the list. bn represent an upperbound of the number of searches. • For a fully sorted list (by usage) with 80/20 distribution, Knuth (1973) has shown that the average search cost C(n) is only 0.122n.

Search Cost Simulation

Search Cost Estimate Regression Analyses yield: b = 0.15 + 0.359a, for 0.2<a<1.0 b = 0.034 + 0.984a, for 0<a<0.2, and C(n) = 0.02 + 0.49a.

Conclusion • The true search cost is between the estimation of b and C(n) • We may use C(n)~0.5a as a way to quickly estimate the search cost of a fully sorted list. • That is, take a moderate sample of usage, the search cost will be half of the inverse of the average usage times the total number of records.

“Far-fetched” (?) Applications • Define and assess the degree of monopoly? What is the effect of monopoly? Note the gap between b and C(n) (ideal). • Gini Index?

Linear Search Efficiency Assessment