Social Choice and Computer Science

Social Choice and Computer Science Fred Roberts, Rutgers University

Thank you to our hosts!

I’m sorry if you’d rather be watching futbol.

Computer Science and the Social Sciences • Many recent applications in CS involve issues/problems of long interest to social scientists (and operations researchers): • preference, utility • conflict and cooperation • allocation • incentives • measurement • social choice orconsensus • Methods developed in SS beginning to be used in CS

CS and SS • CS applications place great strain on SS methods • Sheer size of problems addressed • Computational power of agents an issue • Limitations on information possessed by players • Sequential nature of repeated applications • Thus: Need for new generation of SS methods • Also: These new methods will provide powerful tools to social scientists

Social Choice • Relevant social science problems: voting, group decision making • Goal: based on everyone’s • opinions, reach a “consensus” • Typical opinions expressed as: • “first choice” • ranking of all alternatives • scores • classifications • Long history of research on such problems.

Social Choice and CS: Outline • Consensus Rankings • Meta-search and Collaborative Filtering • Computational Approaches to Information Management in Group Decision Making • Large Databases and Inference • Consensus Computing, Image Processing • Computational Intractability of Consensus Functions • Electronic Voting • Software and Hardware Measurement • Power of a Voter

Consensus Rankings • Background: Arrow’s Impossibility Theorem: • There is no “consensus method” that satisfies certain reasonable axioms about how societies should reach decisions. • Input to Arrow’s Theorem: rankings • of alternatives (ties allowed). • Output: consensus ranking. Kenneth Arrow Nobel prize winner

Consensus Rankings • There are widely studied and widely used consensus methods (that violate one or • more of Arrow’s conditions). • One well-known consensus method: • “Kemeny-Snell medians”: Given set • of rankings, find ranking minimizing • sum of distances to other rankings. • Kemeny-Snell medians are having • surprising new applications in CS. John Kemeny, pioneer in time sharing in CS

Consensus Rankings • Kemeny-Snell distance between rankings: twice the number of pairs of candidates i and j for which i is ranked above j in one ranking and below j in the other + the number of pairs that are ranked in one ranking and tied in another. • ab • x y-z • y x • z • On {x,y}: +2 • On {x,z}: +2 • On {y,z}: +1 • d(a,b) = 5.

Consensus Rankings • Kemeny-Snell median: Given rankings a1, a2, …, ap, find a ranking x so that • d(a1,x) + d(a2,x) + … + d(ap,x) • is minimized. • x can be a ranking other than a1, a2, …, ap. • Sometimes just called Kemeny median.

Consensus Rankings • a1a2a3 • Fish Fish Chicken • Chicken Chicken Fish • Beef Beef Beef • Median = a1. • If x = a1: • d(a1,x) + d(a2,x) + d(a3,x) = 0 + 0 + 2 = 2 • is minimized. • If x = a3, the sum is 4. • For any other x, the sum is at least 1 + 1 + 1 = 3.

Consensus Rankings • a1a2a3 • Fish Chicken Beef • Chicken Beef Fish • Beef Fish Chicken • Three medians = a1, a2, a3. • This is the “voter’s paradox” situation.

Consensus Rankings • a1a2a3 • Fish Chicken Beef • Chicken Beef Fish • Beef Fish Chicken • Note that sometimes we wish to minimize • d(a1,x)2 + d(a2,x)2 + … + d(ap,x)2 • A ranking x that minimizes this is called a Kemeny-Snell mean. • In this example, there is one mean: the ranking declaring all three alternatives tied.

Consensus Rankings • a1a2a3 • Fish Chicken Beef • Chicken Beef Fish • Beef Fish Chicken • If x is the ranking declaring Fish, Chicken • and Beef tied, then • d(a1,x)2 + d(a2,x)2 + … + d(ap,x)2 = • 32 + 32 + 32 = 27. • Not hard to show this is minimum.

Consensus Rankings • Theorem (Bartholdi, Tovey, and Trick, 1989; Wakabayashi, 1986): Computing the Kemeny median of a set of rankings is an NP-complete problem.

Consensus Rankings • Okay, so what does this have to do with practical computer science questions?

Consensus Rankings • I mean reallypractical computer science questions.

Meta-search and Collaborative Filtering • Meta-search • A consensus problem • Combine page rankings from several search engines • Dwork, Kumar, Naor, Sivakumar (2000): Kemeny-Snell medians good in spam resistance in meta-search (spam by a page if it causes meta-search to rank it too highly) • Approximation methods make this computationally tractable

Meta-search and Collaborative Filtering • Collaborative Filtering • Recommending books or movies • Combine book or movie ratings • Produce ordered list of books or movies to recommend • Freund, Iyer, Schapire, Singer (2003): “Boosting” algorithm for combining rankings. • Related topic: Recommender Systems

Meta-search and Collaborative Filtering • A major difference from SS applications: • In SS applications, number of voters is large, number of candidates is small. • In CS applications, number of voters (search engines) is small, number of candidates (pages) is large. • This makes for major new complications and research challenges.

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation • Successful group decision making (social choice) requires efficient elicitation of information and efficient representation of the information elicited. • Old problems in the social sciences. • Computational aspects becoming a focal point because of need to deal with massive and complex information.

Computational Approaches to Information Management in Group Decision Making • Representation and Elicitation • Example I: Social scientists study preferences: “I prefer beef to fish” • Extracting and representing preferences is key in group decision making applications.

Computational Approaches to Information Management in Group Decision Making • Representation and Elicitation • “Brute force” approach: For every pair of alternatives, ask which is preferred to the other. • Often computationally infeasible.

Computational Approaches to Information Management in Group Decision Making • Representation and Elicitation • In many applications (e.g., collaborative filtering), important to elicit preferences automatically. • CP-nets introduced as tool to represent preferences succinctly and provide ways to make inferences about preferences (Boutilier, Brafman, Doomshlak, Hoos, Poole 2004).

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation • Example II: combinatorial auctions. • Auctions increasingly used in business and government. • Information technology allows • complex auctions with huge • number of bidders.

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation • Bidding functions maximizing expected profit can be exceedingly difficult to compute. • Determining the winner of an auction can be extremely hard. (Rothkopf, Pekec, Harstad 1998)

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation • Combinatorial Auctions • Multiple goods auctioned off. • Submit bids for combinations of goods. • This leads to NP-complete allocation problems. • Might not even be able to feasibly express all possible preferences for all subsets of goods. • Rothkopf, Pekec, Harstad (1998): determining winner is computationally tractable for many economically interesting kinds of combinations.

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation Combinatorial Auctions • Decision maker needs to elicit preferences from all agents for all plausible combinations of items in the auction. • Similar problem arises in optimal bundling of goods and services. • Elicitation requires exponentially many queries in general.

Computational Approaches to Information Management in Group Decision Making Representation and Elicitation • Challenge: Recognize situations in which efficient elicitation and representation is possible. • One result: Fishburn, Pekec, Reeds (2002) • Even more complicated: When objects in auction have complex structure. • Problem arises in: • Legal reasoning, sequential decision making, automatic decision devices, collaborative filtering.

Large Databases and Inference • Real data often in form of sequences • Here, concentrate on bioinformatics • GenBank has over 7 million sequences comprising 8.6 billion bases. • The search for similarity or patterns has extended from pairs of sequences to finding patterns that appear in common in a large number of sequences or throughout the database: “consensus sequences”. • Emerging field of “Bioconsensus”: applies SS consensus methods to biological databases.

Large Databases and Inference Why look for such patterns? Similarities between sequences or parts of sequences lead to the discovery of shared phenomena. For example, it was discovered that the sequence for platelet derived factor, which causes growth in the body, is 87% identical to the sequence for v-sis, a cancer-causing gene. This led to the discovery that v-sis works by stimulating growth.

Large Databases and Inference Example Bacterial Promoter Sequences studied by Waterman (1989): RRNABP1: ACTCCCTATAATGCGCCA TNAA: GAGTGTAATAATGTAGCC UVRBP2: TTATCCAGTATAATTTGT SFC: AAGCGGTGTTATAATGCC Notice that if we are looking for patterns of length 4, each sequence has the pattern TAAT.

Large Databases and Inference Example However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the pattern TACT appears, and this has only one mismatch from the pattern TAAT.

Large Databases and Inference Example However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the pattern TACT appears, and this has only one mismatch from the pattern TAAT. So, in some sense, the pattern TAAT is a good consensus pattern.

Large Databases and Inference Example We make this precise using best mismatch distance. Consider two sequences a and b with b longer than a. Then d(a,b) is the smallest number of mismatches in all possible alignments of a as a consecutive subsequence of b.

Large Databases and Inference Example a = 0011, b = 111010 Possible Alignments: 111010 111010 111010 0011 0011 0011 The best-mismatch distance is 2, which is achieved in the third alignment.

Large Databases and Inference • Smith-Waterman • Let  be a finite alphabet of size at least 2 and  be a finite collection of words of length L on . • Let F() be the set of words of length k 2 that are our consensus patterns. (Assume L k.) • Let  = {a1, a2, …, an}. • One way to define F() is as follows. • Let d(a,b) be the best-mismatch distance. • Consider nonnegative parameters sd that are monotone decreasing with d and let F(a1,a2, …, an) be all those words w of length k that maximize • S(w) = isd(w,ai)

Large Databases and Inference • We call such an F a Smith-Waterman consensus. • In particular, Waterman and others use the parameters • sd = (k-d)/k. • Example: • An alphabet used frequently is the purine/pyrimidine alphabet {R,Y}, where R = A (adenine) or G (guanine) and Y = C (cytosine) or T (thymine). • For simplicity, it is easier to use the digits 0,1 rather than the letters R,Y. • Thus, let  = {0,1}, let k = 2. Then the possible pattern words are 00, 01, 10, 11.

Large Databases and Inference • Suppose a1 = 111010, a2 = 111111. How do we find F(a1,a2)? • We have: • d(00,a1) = 1, d(00,a2) = 2 • d(01,a1) = 0, d(01,a2) = 1 • d(10,a1) = 0, d(10,a2) = 1 • d(11,a1) = 0, d(11,a2) = 0 • S(00) =  sd(00,ai) = s1 + s2, • S(01) = sd(01,ai) = s0 + s1 • S(10) = sd(10,ai) = s0 + s1 • S(11) = sd(11,ai) = s0 + s0 • As long as s0 > s1 > s2, it follows that 11 is the consensus pattern, according to Smith-Waterman’s consensus.

Example: • Let  ={0,1}, k = 3, and consider F(a1,a2,a3) where • a1 = 000000, a2 = 100000, a3 = 111110. The possible pattern words are: 000, 001, 010, 011, 100, 101, 110, 111. • d(000,a1) = 0, d(000,a2) = 0, d(000,a3) = 2, • d(001,a1) = 1, d(001,a2) = 1, d(001,a3) = 2, • d(100,a1) = 1, d(100,a2) = 0, d(100,a3) = 1, etc. • S(000) = s2 + 2s0, S(001) = s2 + 2s1, S(100) = 2s1 + s0, etc. • Now, s0 > s1 > s2 implies that S(000) > S(001). • Similarly, one shows that the score is maximized by S(000) or S(100). • Monotonicity doesn’t say which of these is highest.

Large Databases and Inference • The Special Case sd = (k-d)/k • Then it is easy to show that the words w that maximize s(w) are exactly the words w that minimize • id(w,ai). • In other words: In this case, the Smith-Waterman consensus is exactly the median. • Algorithms for computing consensus sequences such as Smith-Waterman are important in modern molecular biology.

Large Databases and Inference • Other Topics in “Bioconsensus” • Alternative phylogenies (evolutionary trees) are produced using different methods and we need to choose a consensus tree. • Alternative taxonomies (classifications) are produced using different models and we need to choose a consensus taxonomy. • Alternative molecular sequences are produced using different criteria or different algorithms and we need to choose a consensus sequence. • Alternative sequence alignments are produced and we need to choose a consensus alignment.

Large Databases and Inference • Other Topics in “Bioconsensus” • Several recent books on bioconsensus. • Day and McMorris [2003] • Janowitz, et al. [2003] • Bibliography compiled by Bill Day: In molecular biology alone, hundreds of papers using consensus methods in biology. • Large database problems in CS are being approached using methods of “bioconsensus” having their origin in the social sciences.

Social Choice and Computer Science