1 / 47

Exam Review Session

Exam Review Session. William Cohen. Outline - what you had questions on . Randomized Algorithms Graph Algorithms Spectral Methods Similarity Joins First-Order Methods Hint: review coding projects like naïve Bayes, phrases algorithm, approx PR…. General hints in studying.

Télécharger la présentation

Exam Review Session

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exam Review Session William Cohen

  2. Outline - what you had questions on • Randomized Algorithms • Graph Algorithms • Spectral Methods • Similarity Joins • First-Order Methods • Hint: review coding projects like naïve Bayes, phrases algorithm, approx PR…

  3. General hints in studying • Understand what you’ve done and why • There will be questions that test your understanding of the techniques implemented • why will/won’t this shortcut work? • what does the analysis say about the method? • I tend not to ask a lot about my own work • But it does tell you what I think is important! • Techniques covered in class but not assignments: • When/where/how to use them • That usually includes understanding the analytic results presented in class • Eg: the derivation of the lazy logregregularizer, vs just the updte rule

  4. General hints in exam taking • You can bring in one 8 ½ by 11” sheet • Look over everything quickly and skip around • probably nobody will know everything on the test • If you’re not sure what we’ve got in mind: state your assumptions clearly in your answer.

  5. Randomized Algorithms

  6. Randomized Algorithms • Hash kernels • Countmin sketch • Bloom filters • LSH • What are they, and what are they used for? When would you use which one? • Why do they work - ei, what analytic results have we looked at?

  7. Bloom filters - review • An implementation • Allocate M bits, bit[0]…,bit[1-M] • Pick K hash functions hash(1,2),hash(2,s),…. • E.g: hash(i,s) = hash(s+ randomString[i]) • To add string s: • For i=1 to k, set bit[hash(i,s)] = 1 • To check contains(s): • For i=1 to k, test bit[hash(i,s)] • Return “true” if they’re all set; otherwise, return “false” • We’ll discuss how to set M and K soon, but for now: • Let M = 1.5*maxSize// less than two bits per item! • Let K = 2*log(1/p) // about right with this M

  8. from: Minos Garofalakis +c h1(s) +c +c hd(s) +c CM Sketch Structure • Each string is mapped to one bucket per row • Estimate A[j] by taking mink { CM[k,hk(j)] } • Errors are always over-estimates • Sizes: d=log 1/, w=2/  error is usuallylessthan||A||1 <s, +c> d=log 1/ w = 2/ A Quick Intro to Data Stream Algorithmics – CS262

  9. from: Minos Garofalakis CM Sketch Guarantees • [Cormode, Muthukrishnan’04] CM sketch guarantees approximation error on point queries less than e||A||1 in space O(1/e log 1/d) • Probability of more error is less than 1-d • This is sometimes enough: • Estimating a multinomial: if A[s] = Pr(s|…) then ||A||1 = 1 • Multiclass classification: if Ax[s] = Pr(x in class s) then ||Ax||1 is probably small, since most x’s will be in only a few classes A Quick Intro to Data Stream Algorithmics – CS262

  10. An Application of a Count-Min Sketch x appears near y • Problem: find the semantic orientation of a work (positive or negative) using a large corpus. • Idea: • positive words co-occur more frequently than expected near positive words; likewise for negative words • so pick a few pos/neg seeds and compute

  11. LSH: key ideas • Goal: • map feature vector xto bit vectorbx • ensure that bxpreserves “similarity” • Basic idea: use random projections of x • Repeat many times: • Pick a random hyperplaner by picking random weights for each feature (say from a Gaussian) • Compute the inner product of r with x • Record if x is “close to” r (r.x>=0) • the next bit in bx • Theory says that is x’ and x have small cosine distance then bxand bx’ will have small Hamming distance

  12. LSH applications • Compact storage of data • and we can still compute similarities • LSH also gives very fast approximations: • approx nearest neighbor method • just look at other items with bx’=bx • also very fast nearest-neighbor methods for Hamming distance • very fast clustering • cluster = all things with same bx vector

  13. Graph Algorithms

  14. Graph Algorithms • Properties of large data graphs • Spectral algorithms - multiple questions • APR algorithm - how and why it works • when we’d use it • what the analysis says

  15. Degree distribution • Plot cumulative degree • X axis is degree • Y axis is #nodes that have degree at least k • Typically use a log-log scale • Straight lines are a power law; normal curve dives to zero at some point • Left: trust network in epinions web site from Richardson & Domingos

  16. Homophily • Another def’n: excess edges between common neighbors of v

  17. Approximate PageRank: Algorithm

  18. Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Recursively compute PageRank of “neighbors of s” (=sW), then adjust • Key idea in apr: • do this “recursive step” repeatedly • focus on nodes where finding PageRank from neighbors will be useful

  19. Analysis linearity re-group & linearity pr(α, r - r(u)χu) +(1-α) pr(α, r(u)χuW) = pr(α, r - r(u)χu+ (1-α) r(u)χuW)

  20. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” W * v1 = v2 W: normalized so columns sum to 1 C A B G H I J F D E

  21. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block-stochastic matrix?

  22. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” λ1 λ2 e1 e3 “eigengap” λ3 λ4 e2 λ5,6,7,…. [Shi & Meila, 2002]

  23. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” e2 0.4 0.2 x x x x x x x x x x x 0.0 x -0.2 y z y y z z z e3 -0.4 y z z z z z z z y e1 e2 -0.4 -0.2 0 0.2 [Shi & Meila, 2002]

  24. Spectral Clustering: Pros and Cons • Elegant, and well-founded mathematically • Tends to avoid local minima • Optimal solution to relaxed version of mincut problem (Normalized cut, aka NCut) • Works quite well when relations are approximately transitive (like similarity, social connections) • Expensive for very large datasets • Computing eigenvectors is the bottleneck • Approximate eigenvector computation not always useful • Noisy datasets sometimes cause problems • Picking number of eigenvectors and k is tricky • “Informative” eigenvectors need not be in top few • Performance can drop suddenly from good to terrible

  25. D-A is the Laplacian I-W is the random-walk Laplacian (NCUT) other variants lead to other spectral clustering methods = size of CUT(y) NCUT: roughly minimize ratio of transitions between classes vs transitions within classes

  26. Repeated averaging with neighbors on a sample problem… • Create a graph, connecting all points in the 2-D initial space to all other points • Weighted by distance • Run power iteration (averaging) for 10 steps • Plot node id x vs v10(x) • nodes are ordered and colored by actual cluster number blue green ___red___ g g g g g g g g g b b r r r r … b b r r this graph is sometimes called a data manifold r b

  27. Experimental results: best-case assignment of class labels to clusters

  28. Similarity Joins

  29. Similarity Joins • Key ideas • indexing to find candidates quickly • special properties of TFIDF • short-cuts, upper bounds and A* search

  30. Build index on-the-fly • When finding matches for x consider y before x in ordering • Keep x[i] in inverted index for i • so you can find dot product dot(x,y) without using y

  31. maxweighti(V) * x[i] >= best score for matching on i x[i]  x’ here – x’ is the unindexed part of x • Build index on-the-fly • only index enough of x so that you can be sure to find it • save score of things only reachable by non-indexed fields < t • total mass of what you index needs to be large enough • correction: • indexes no longer have enough info to compute dot(x,y) • ordering commonrare features is heuristic (any order is ok)

  32. First Order LEarning

  33. First-order learning • Expressiveness: when should you use it? • Analytics: what’s the cost of the operations?

  34. General hints in studying • Understand what you’ve done and why • There will be questions that test your understanding of the techniques implemented • why will/won’t this shortcut work? • what does the analysis say about the method? • I tend not to ask a lot about my own work • But it does tell you what I think is important! • Techniques covered in class but not assignments: • When/where/how to use them • That usually includes understanding the analytic results presented in class • Eg: the derivation of the lazy logregregularizer, vs just the updte rule

  35. for example Learning to Reason with Extracted Information William W. CohenCarnegie Mellon University joint work with: William Wang, Kathryn Rivard Mazaitis, Stephen Muggleton, Tom Mitchell, Ni Lao, Richard Wang, Frank Lin, Ni Lao, Estevam Hruschka, Jr., Burr Settles, Partha Talukdar, Derry Wijaya, Edith Law, Justin Betteridge, Jayant Krishnamurthy, Bryan Kisiel, Andrew Carlson, Weam Abu Zaki , Bhavana Dalvi, Malcolm Greaves, Lise Getoor, Jay Pujara, Hui Miao, …

  36. for example Motivation • MLNs (and comparable probabilistic first-order logics) are very general tools for constructing learning algorithms • But: they’re computationally expensive • converting to Markov nets: O(nk) • where k is predicate arity, n is the size of the database (#facts about the problem) • inference in Markov nets (even small ones) is intractable • and really should be in the inner loop of the learner

  37. for example Motivation • What would a tractable version of MLNs look like? • inference would have to be constrained • MLNs allow: (a ^ b  c V d) == (~a V ~b V c V d) • Horn clauses: (a ^ b  c) == (~a V ~b V c) • but that’s not enough: • even binary (a  b) clauses become hard to evaluate as MLNs • you’d have to build a small “network” (or something like it) from a large database • how?

  38. for example Motivation • What would a tractable version of MLNs look like? • would it still be rich enough to be useful?

  39. Markov Logic • Logical language: First-order logic • Probabilistic language: Markov networks • Syntax: First-order formulas with weights • Semantics: Templates for Markov net features • Learning: • Parameters: Generative or discriminative • Structure: ILP with arbitrary clauses and MAP score • Inference: • MAP: Weighted satisfiability • Marginal: MCMC with moves proposed by SAT solver • Partial grounding + Lazy inference

  40. Markov Networks: [Review] Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques

  41. Inference in Markov Networks • Goal: Compute marginals & conditionals of • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this

  42. Markov Logic • Logical language: First-order logic • Probabilistic language: Markov networks • Syntax: First-order formulas with weights • Semantics: Templates for Markov net features • Learning: • Parameters: Generative or discriminative • Structure: ILP with arbitrary clauses and MAP score • Inference: • MAP: Weighted satisfiability • Marginal: MCMC with moves proposed by SAT solver • Partial grounding + Lazy inference

  43. MCMC is Insufficient for Logic • Problem:Deterministic dependencies break MCMCNear-deterministic ones make it veryslow • Solution:Combine MCMC and WalkSAT→MC-SAT algorithm [Poon & Domingos, 2006]

  44. Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.) Relation to Statistical Models

  45. Q/A

More Related