Computer Science and the Socio-Economic Sciences

Computer Science and the Socio-Economic Sciences Fred Roberts, Rutgers University

CS and SS • Many recent applications in CS involve issues/problems of long interest to social scientists: • preference, utility • conflict and cooperation • allocation • incentives • consensus • social choice • measurement • Methods developed in SS beginning to be used in CS

CS and SS • CS applications place great strain on SS methods • Sheer size of problems addressed • Computational power of agents an issue • Limitations on information possessed by players • Sequential nature of repeated applications • Thus: Need for new generation of SS methods • Also: These new methods will provide powerful tools to social scientists

CS and SS: Outline • CS and Consensus/Social Choice • 2. CS and Game Theory • 3. Algorithmic Decision Theory

CS and Consensus/Social Choice • Relevant social science problems: voting, group decision making • Goal: based on everyone’s • opinions, reach a “consensus” • Typical opinions: • “first choice” • ranking of all alternatives • scores • classifications • Long history of research on such problems.

CS and Consensus/Social Choice Background: Arrow’s Impossibility Theorem: There is no “consensus method” that satisfies certain reasonable axioms about how societies should reach decisions. Input: rankings of alternatives. Output: consensus ranking. Kenneth Arrow Nobel prize winner

CS and Consensus/Social Choice There are widely studied and widely used consensus methods. One well-known consensus method: “Kemeny-Snell medians”: Given set of rankings, find ranking minimizing sum of distances to other rankings. Kemeny-Snell medians are having surprising new applications in CS. John Kemeny, pioneer in time sharing in CS

CS and Consensus/Social Choice Kemeny-Snell distance between rankings: twice the number of pairs of candidates i and j for which i is ranked above j in one ranking and below j in the other + the number of pairs that are ranked in one ranking and tied in another. Kemeny-Snell median: Given rankings a1, a2, …, ap, find a ranking x so that d(a1,x) + d(a2,x) + … + d(ap,x) is minimized. Sometimes just called Kemeny median.

CS and Consensus/Social Choice a1a2a3 Fish Fish Chicken Chicken Chicken Fish Beef Beef Beef Median = a1. If x = a1: d(a1,x) + d(a2,x) + d(a3,x) = 0 + 0 + 2 is minimized. If x = a3, the sum is 4. For any other x, the sum is at least 1 + 1 + 1 = 3.

CS and Consensus/Social Choice a1a2a3 Fish Chicken Beef Chicken Beef Fish Beef Fish Chicken Three medians = a1, a2, a3. This is the “voter’s paradox” situation.

CS and Consensus/Social Choice a1a2a3 Fish Chicken Beef Chicken Beef Fish Beef Fish Chicken Note that sometimes we wish to minimize d(a1,x)2 + d(a2,x)2 + … + d(ap,x)2 A ranking x that minimizes this is called a Kemeny-Snell mean. In this example, there is one mean: the ranking declaring all three alternatives tied.

CS and Consensus/Social Choice a1a2a3 Fish Chicken Beef Chicken Beef Fish Beef Fish Chicken If x is the ranking declaring Fish, Chicken and Beef tied, then d(a1,x)2 + d(a2,x)2 + … + d(ap,x)2 = 32 + 32 + 32 = 27. Not hard to show this is minimum.

CS and Consensus/Social Choice • Theorem (Bartholdi, Tovey, and Trick, 1989; Wakabayashi, 1986): Computing the Kemeny median of a set of rankings is an NP-complete problem.

Meta-search and Collaborative Filtering • Meta-search • A consensus problem • Combine page rankings from several search engines • Dwork, Kumar, Naor, Sivakumar (2000): Kemeny-Snell medians good in spam resistance in meta-search (spam by a page if it causes meta-search to rank it too highly) • Approximation methods make this computationally tractable

Meta-search and Collaborative Filtering • Collaborative Filtering • Recommending books or movies • Combine book or movie ratings • Produce ordered list of books or movies to recommend • Freund, Iyer, Schapire, Singer (2003): “Boosting” algorithm for combining rankings. • Related topic: Recommender Systems

Meta-search and Collaborative Filtering • A major difference from SS applications: • In SS applications, number of voters is large, number of candidates is small. • In CS applications, number of voters (search engines) is small, number of candidates (pages) is large. • This makes for major new complications and research challenges.

Large Databases and Inference • Real data often in form of sequences • GenBank has over 7 million sequences comprising 8.6 billion bases. • The search for similarity or patterns has extended from pairs of sequences to finding patterns that appear in common in a large number of sequences or throughout the database: “consensus sequences”. • Emerging field of “Bioconsensus”: applies SS consensus methods to biological databases.

Large Databases and Inference Why look for such patterns? Similarities between sequences or parts of sequences lead to the discovery of shared phenomena. For example, it was discovered that the sequence for platelet derived factor, which causes growth in the body, is 87% identical to the sequence for v-sis, a cancer-causing gene. This led to the discovery that v-sis works by stimulating growth.

Large Databases and Inference Example Bacterial Promoter Sequences studied by Waterman (1989): RRNABP1: ACTCCCTATAATGCGCCA TNAA: GAGTGTAATAATGTAGCC UVRBP2: TTATCCAGTATAATTTGT SFC: AAGCGGTGTTATAATGCC Notice that if we are looking for patterns of length 4, each sequence has the pattern TAAT.

Large Databases and Inference Example However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the pattern TACT appears, and this has only one mismatch from the pattern TAAT.

Large Databases and Inference Example However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the pattern TACT appears, and this has only one mismatch from the pattern TAAT. So, in some sense, the pattern TAAT is a good consensus pattern.

Large Databases and Inference Example We make this precise using best mismatch distance. Consider two sequences a and b with b longer than a. Then d(a,b) is the smallest number of mismatches in all possible alignments of a as a consecutive subsequence of b.

Large Databases and Inference Example a = 0011, b = 111010 Possible Alignments: 111010 111010 111010 0011 0011 0011 The best-mismatch distance is 2, which is achieved in the third alignment.

Large Databases and Inference Example Now given a database of sequences a1, a2, …, an. Look for a pattern of length k. One standard method (Smith-Waterman): look for a consensus sequence b that minimizes i[k-d(b,ai)]/d(b,ai), where d is best mismatch distance. In fact, this turns out to be equivalent to calculating medians like Kemeny-Snell medians. Algorithms for computing consensus sequences are important in modern molecular biology.

Large Databases and Inference • Preferential Queries • Look for flight from New York to Beijing • Have preferences for • airline • itinerary • type of ticket • Try to combine responses from multiple travel-related websites • Sequential decision making: Next query or information access depends on prior responses.

Consensus Computing, Image Processing • Old SS problem: Dynamic modeling of how individuals change opinions over time, eventually reaching consensus. • Often use dynamic models on graphs • Related to neural nets. • CS applications: distributed computing. • Values of processors in a network are updated until all have same value.

Consensus Computing, Image Processing • CS application: Noise removal in digital images • Does a pixel level represent noise? • Compare neighboring pixels. • If values beyond threshold, replace pixel value with mean or median of values of neighbors. • Related application in distributed computing. • Values of faulty processors are replaced by those of neighboring non-faulty ones. • Berman and Garay (1993) use “parliamentary procedure” called cloture

Computational Intractability of Consensus Functions • Bartholdi, Tovey and Trick: There are voting schemes where it can be computationally intractable to determine who won an election. • Computational intractability can be a good thing in an election: Designing voting systems where it is computationally intractable to “manipulate” the outcome of an election by “insincere voting”: • Adding voters • Declaring voters ineligible • Adding candidates • Declaring candidates ineligible

Electronic Voting • Issues: • Correctness • Anonymity • Availability • Security • Privacy

Electronic Voting • Security Risks in Electronic Voting • Threat of “denial of service attacks” • Threat of penetration attacks involving a delivery mechanism to transport a malicious payload to target host (thru Trojan horse or remote control program) • Private and correct counting of votes • Cryptographic challenges to keep votes private • Relevance of work on secure multiparty computation

Electronic Voting • Other CS Challenges: • Resistance to “vote buying” • Development of user-friendly interfaces • Vulnerabilities of communication path between the voting client (where you vote) and the server (where votes are counted) • Reliability issues: random hardware and software failures

Software & Hardware Measurement • Theory of measurement developed by mathematical social scientists • Measurement theory studies ways to combine scores obtained on different criteria. • A statement involving scales of • measurement is considered meaningfulif its truth or falsity is unchanged under acceptable transformations of all scales involved. • Example: It is meaningful to say that I weigh more than my daughter. • That is because if it is true in kilograms, then it is also true in pounds, in grams, etc.

Software & Hardware Measurement • Measurement theory has studied what statements you can make after averaging scores. • Think of averaging as a consensus method. • One general principle: To say that the average score of one set of tests is greater than the average score of another set of tests is not meaningful (it is meaningless) under certain conditions. • This is often the case if the averaging procedure is to take the arithmetic mean: If s(xi) is score of xi, i = 1, 2, …, n, then arithmetic mean is • is(xi)/n. • Long literature on what averaging methods lead to meaningful conclusions.

Software & Hardware Measurement • A widely used method in hardware measurement: • Score a computer system on different benchmarks. • Normalize score relative to performance of one base system • Average normalized scores • Pick system with highest average. • Fleming and Wallace (1986): Outcome can depend on choice of base system. • Meaningless in sense of measurement theory • Leads to theory of merging normalized scores

Software & Hardware Measurement • Hardware Measurement BENCHMARK E F G H I P R O C E S S O R R M Z Data from Heath,Comput. Archit. News(1984)

Software & Hardware Measurement • Normalize Relative to Processor R BENCHMARK E F G H I P R O C E S S O R R M Z

Software & Hardware Measurement • Take Arithmetic Mean of Normalized Scores Arithmetic Mean BENCHMARK E F G H I P R O C E S S O R 1.00 R 1.01 M 1.07 Z

Software & Hardware Measurement • Take Arithmetic Mean of Normalized Scores Arithmetic Mean BENCHMARK E F G H I P R O C E S S O R 1.00 R 1.01 M 1.07 Z Conclude that machine Z is best

Software & Hardware Measurement • Now Normalize Relative to Processor M BENCHMARK E F G H I P R O C E S S O R R M Z

Software & Hardware Measurement • Take Arithmetic Mean of Normalized Scores Arithmetic Mean BENCHMARK E F G H I 1.32 P R O C E S S O R R 1.00 M 1.08 Z

Software & Hardware Measurement • Take Arithmetic Mean of Normalized Scores Arithmetic Mean BENCHMARK E F G H I 1.32 P R O C E S S O R R 1.00 M 1.08 Z Conclude that machine R is best

Software and Hardware Measurement • So, the conclusion that a given machine is best by taking arithmetic mean of normalized scores is meaningless in this case. • Above example from Fleming and Wallace (1986), data from Heath (1984) • Sometimes, geometric mean is helpful. • Geometric mean is •  is(xi)  n

Software & Hardware Measurement • Normalize Relative to Processor R Geometric Mean BENCHMARK E F G H I P R O C E S S O R R 1.00 .86 M .84 Z Conclude that machine R is best

Software & Hardware Measurement • Now Normalize Relative to Processor M BENCHMARK Geometric Mean E F G H I P R O C E S S O R R 1.17 1.00 M .99 Z Still conclude that machine R is best

Software and Hardware Measurement • In this situation, it is easy to show that the conclusion that a given machine has highest geometric mean normalized score is a meaningful conclusion. • Even meaningful: A given machine has geometric mean normalized score 20% higher than another machine. • Fleming and Wallace give general conditions under which comparing geometric means of normalized scores is meaningful. • Research area: what averaging procedures make sense in what situations? Large literature. • Note: There are situations where comparing arithmetic means is meaningful but comparing geometric means is not.

Software and Hardware Measurement • Message from measurement theory to computer science: • Do not perform arithmetic operations on data without paying attention to whether the conclusions you get are meaningful.

CS and SS: Outline • CS and Consensus/Social Choice • 2. CS and Game Theory • 3. Algorithmic Decision Theory

CS and Game Theory • Game theory a long history in economics; also in operations research, mathematics • Recently, computer scientists discovering relevance to their problems • Increasingly complex games arise in practical applications: auctions, Internet • Need new game-theoretic methods for CS problems. • Need new CS methods to solve modern game theory problems.

Computer Science and the Socio-Economic Sciences