Exam Review Session

Exam Review Session William Cohen

Outline - what you had questions on • Randomized Algorithms • Graph Algorithms • Spectral Methods • Similarity Joins • First-Order Methods • Hint: review coding projects like naïve Bayes, phrases algorithm, approx PR…

General hints in studying • Understand what you’ve done and why • There will be questions that test your understanding of the techniques implemented • why will/won’t this shortcut work? • what does the analysis say about the method? • I tend not to ask a lot about my own work • But it does tell you what I think is important! • Techniques covered in class but not assignments: • When/where/how to use them • That usually includes understanding the analytic results presented in class • Eg: the derivation of the lazy logregregularizer, vs just the updte rule

General hints in exam taking • You can bring in one 8 ½ by 11” sheet • Look over everything quickly and skip around • probably nobody will know everything on the test • If you’re not sure what we’ve got in mind: state your assumptions clearly in your answer.

Randomized Algorithms

Randomized Algorithms • Hash kernels • Countmin sketch • Bloom filters • LSH • What are they, and what are they used for? When would you use which one? • Why do they work - ei, what analytic results have we looked at?

Bloom filters - review • An implementation • Allocate M bits, bit[0]…,bit[1-M] • Pick K hash functions hash(1,2),hash(2,s),…. • E.g: hash(i,s) = hash(s+ randomString[i]) • To add string s: • For i=1 to k, set bit[hash(i,s)] = 1 • To check contains(s): • For i=1 to k, test bit[hash(i,s)] • Return “true” if they’re all set; otherwise, return “false” • We’ll discuss how to set M and K soon, but for now: • Let M = 1.5*maxSize// less than two bits per item! • Let K = 2*log(1/p) // about right with this M

from: Minos Garofalakis +c h1(s) +c +c hd(s) +c CM Sketch Structure • Each string is mapped to one bucket per row • Estimate A[j] by taking mink { CM[k,hk(j)] } • Errors are always over-estimates • Sizes: d=log 1/, w=2/  error is usuallylessthan||A||1 <s, +c> d=log 1/ w = 2/ A Quick Intro to Data Stream Algorithmics – CS262

from: Minos Garofalakis CM Sketch Guarantees • [Cormode, Muthukrishnan’04] CM sketch guarantees approximation error on point queries less than e||A||1 in space O(1/e log 1/d) • Probability of more error is less than 1-d • This is sometimes enough: • Estimating a multinomial: if A[s] = Pr(s|…) then ||A||1 = 1 • Multiclass classification: if Ax[s] = Pr(x in class s) then ||Ax||1 is probably small, since most x’s will be in only a few classes A Quick Intro to Data Stream Algorithmics – CS262

An Application of a Count-Min Sketch x appears near y • Problem: find the semantic orientation of a work (positive or negative) using a large corpus. • Idea: • positive words co-occur more frequently than expected near positive words; likewise for negative words • so pick a few pos/neg seeds and compute

LSH: key ideas • Goal: • map feature vector xto bit vectorbx • ensure that bxpreserves “similarity” • Basic idea: use random projections of x • Repeat many times: • Pick a random hyperplaner by picking random weights for each feature (say from a Gaussian) • Compute the inner product of r with x • Record if x is “close to” r (r.x>=0) • the next bit in bx • Theory says that is x’ and x have small cosine distance then bxand bx’ will have small Hamming distance

LSH applications • Compact storage of data • and we can still compute similarities • LSH also gives very fast approximations: • approx nearest neighbor method • just look at other items with bx’=bx • also very fast nearest-neighbor methods for Hamming distance • very fast clustering • cluster = all things with same bx vector

Graph Algorithms

Graph Algorithms • Properties of large data graphs • Spectral algorithms - multiple questions • APR algorithm - how and why it works • when we’d use it • what the analysis says

Degree distribution • Plot cumulative degree • X axis is degree • Y axis is #nodes that have degree at least k • Typically use a log-log scale • Straight lines are a power law; normal curve dives to zero at some point • Left: trust network in epinions web site from Richardson & Domingos

Homophily • Another def’n: excess edges between common neighbors of v

Approximate PageRank: Algorithm

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Recursively compute PageRank of “neighbors of s” (=sW), then adjust • Key idea in apr: • do this “recursive step” repeatedly • focus on nodes where finding PageRank from neighbors will be useful

Analysis linearity re-group & linearity pr(α, r - r(u)χu) +(1-α) pr(α, r(u)χuW) = pr(α, r - r(u)χu+ (1-α) r(u)χuW)

Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” W * v1 = v2 W: normalized so columns sum to 1 C A B G H I J F D E

Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block-stochastic matrix?

Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” λ1 λ2 e1 e3 “eigengap” λ3 λ4 e2 λ5,6,7,…. [Shi & Meila, 2002]

Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” e2 0.4 0.2 x x x x x x x x x x x 0.0 x -0.2 y z y y z z z e3 -0.4 y z z z z z z z y e1 e2 -0.4 -0.2 0 0.2 [Shi & Meila, 2002]

Spectral Clustering: Pros and Cons • Elegant, and well-founded mathematically • Tends to avoid local minima • Optimal solution to relaxed version of mincut problem (Normalized cut, aka NCut) • Works quite well when relations are approximately transitive (like similarity, social connections) • Expensive for very large datasets • Computing eigenvectors is the bottleneck • Approximate eigenvector computation not always useful • Noisy datasets sometimes cause problems • Picking number of eigenvectors and k is tricky • “Informative” eigenvectors need not be in top few • Performance can drop suddenly from good to terrible

D-A is the Laplacian I-W is the random-walk Laplacian (NCUT) other variants lead to other spectral clustering methods = size of CUT(y) NCUT: roughly minimize ratio of transitions between classes vs transitions within classes

Repeated averaging with neighbors on a sample problem… • Create a graph, connecting all points in the 2-D initial space to all other points • Weighted by distance • Run power iteration (averaging) for 10 steps • Plot node id x vs v10(x) • nodes are ordered and colored by actual cluster number blue green ___red___ g g g g g g g g g b b r r r r … b b r r this graph is sometimes called a data manifold r b

Experimental results: best-case assignment of class labels to clusters

Similarity Joins

Similarity Joins • Key ideas • indexing to find candidates quickly • special properties of TFIDF • short-cuts, upper bounds and A* search

Build index on-the-fly • When finding matches for x consider y before x in ordering • Keep x[i] in inverted index for i • so you can find dot product dot(x,y) without using y

maxweighti(V) * x[i] >= best score for matching on i x[i]  x’ here – x’ is the unindexed part of x • Build index on-the-fly • only index enough of x so that you can be sure to find it • save score of things only reachable by non-indexed fields < t • total mass of what you index needs to be large enough • correction: • indexes no longer have enough info to compute dot(x,y) • ordering commonrare features is heuristic (any order is ok)

First Order LEarning

First-order learning • Expressiveness: when should you use it? • Analytics: what’s the cost of the operations?

General hints in studying • Understand what you’ve done and why • There will be questions that test your understanding of the techniques implemented • why will/won’t this shortcut work? • what does the analysis say about the method? • I tend not to ask a lot about my own work • But it does tell you what I think is important! • Techniques covered in class but not assignments: • When/where/how to use them • That usually includes understanding the analytic results presented in class • Eg: the derivation of the lazy logregregularizer, vs just the updte rule

for example Learning to Reason with Extracted Information William W. CohenCarnegie Mellon University joint work with: William Wang, Kathryn Rivard Mazaitis, Stephen Muggleton, Tom Mitchell, Ni Lao, Richard Wang, Frank Lin, Ni Lao, Estevam Hruschka, Jr., Burr Settles, Partha Talukdar, Derry Wijaya, Edith Law, Justin Betteridge, Jayant Krishnamurthy, Bryan Kisiel, Andrew Carlson, Weam Abu Zaki , Bhavana Dalvi, Malcolm Greaves, Lise Getoor, Jay Pujara, Hui Miao, …

for example Motivation • MLNs (and comparable probabilistic first-order logics) are very general tools for constructing learning algorithms • But: they’re computationally expensive • converting to Markov nets: O(nk) • where k is predicate arity, n is the size of the database (#facts about the problem) • inference in Markov nets (even small ones) is intractable • and really should be in the inner loop of the learner

for example Motivation • What would a tractable version of MLNs look like? • inference would have to be constrained • MLNs allow: (a ^ b  c V d) == (~a V ~b V c V d) • Horn clauses: (a ^ b  c) == (~a V ~b V c) • but that’s not enough: • even binary (a  b) clauses become hard to evaluate as MLNs • you’d have to build a small “network” (or something like it) from a large database • how?

for example Motivation • What would a tractable version of MLNs look like? • would it still be rich enough to be useful?

Markov Logic • Logical language: First-order logic • Probabilistic language: Markov networks • Syntax: First-order formulas with weights • Semantics: Templates for Markov net features • Learning: • Parameters: Generative or discriminative • Structure: ILP with arbitrary clauses and MAP score • Inference: • MAP: Weighted satisfiability • Marginal: MCMC with moves proposed by SAT solver • Partial grounding + Lazy inference

Markov Networks: [Review] Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques

Inference in Markov Networks • Goal: Compute marginals & conditionals of • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this

Markov Logic • Logical language: First-order logic • Probabilistic language: Markov networks • Syntax: First-order formulas with weights • Semantics: Templates for Markov net features • Learning: • Parameters: Generative or discriminative • Structure: ILP with arbitrary clauses and MAP score • Inference: • MAP: Weighted satisfiability • Marginal: MCMC with moves proposed by SAT solver • Partial grounding + Lazy inference

MCMC is Insufficient for Logic • Problem:Deterministic dependencies break MCMCNear-deterministic ones make it veryslow • Solution:Combine MCMC and WalkSAT→MC-SAT algorithm [Poon & Domingos, 2006]

Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.) Relation to Statistical Models

Q/A

Exam Review Session

Exam Review Session

Presentation Transcript

MatSE 259 Exam 1 Review Session

Phys 221 FINAL exam review session

Final Exam Review Session

PHYSICS 2010 Final Exam Review Session

Review session for exam-III

Exam II Review Session

TExES Review Session Principal Exam

Exam #4 Review Session

Review session for exam-III

AP HG Exam #3 Review Session

2012 ORAL EXAM REVIEW SESSION

Exam 1 – Review Session

MATSE 259 EXAM 3 REVIEW SESSION

FIN 230 Final Exam Review Session

Review session for exam-III

MatSE 259 Exam 2 Review Session

Exam 1 Review Session

FIN 230 Final Exam Review Session

FIN 230 Exam I Review Session

Final Exam Review Session 1

MATSE 259 EXAM 3 REVIEW SESSION

MatSE 259 Exam 2 Review Session