1 / 57

Dimensionality Reduction and Embeddings

Dimensionality Reduction and Embeddings. SVD: The mathematical formulation. Normalize the dataset by moving the origin to the center of the dataset Find the eigenvectors of the data (or covariance) matrix These define the new space Sort the eigenvalues in “ goodness ” order. f2. e1. e2.

rebecaa
Télécharger la présentation

Dimensionality Reduction and Embeddings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimensionality Reduction and Embeddings

  2. SVD: The mathematical formulation • Normalize the dataset by moving the origin to the center of the dataset • Find the eigenvectors of the data (or covariance) matrix • These define the new space • Sort the eigenvalues in “goodness” order f2 e1 e2 f1

  3. Compute Approximate SVD efficiently • Exact SVD is expensive: O(min{n2 m, n m2}) So, we try to compute it approximately. We exploit the fact that, if A = ULVT then: AAT = UL2UT and ATA = V L2VT 1. Random projection + SVD. Cost O(m n logn) 2. Random sampling ( p rows) and then SVD on the samples. Cost O(max{m p2+p3}) or O(p4)!! (caution: constants can be high!)

  4. Approximate SVD • We can guarantee an approximation like the following: || A – P||2F <= || A – Ak||2F+ e ||A||2F

  5. A randomized SVD

  6. SVD Cont’d • Advantages: • Optimal dimensionality reduction (for linear projections) • Disadvantages: • Computationally expensive… but can be improved with random sampling • Sensitive to outliers and non-linearities

  7. Embeddings • Given a metric distance matrix D, embed the objects in a k-dimensional vector space using a mapping F such that • D(i,j) is close to D’(F(i),F(j)) • Isometric mapping: • exact preservation of distance • Contractive mapping: • D’(F(i),F(j)) <= D(i,j) • D’ is some Lp measure

  8. Multi-Dimensional Scaling (MDS) • Map the items in a k-dimensional space trying to minimize the stress • Steepest Descent algorithm: • Start with an assignment • Minimize stress by moving points • But the running time is O(N2) and O(N) to add a new item • Another method: stress iterative majorization

  9. FastMap What if we have a finite metric space (X, d )? Faloutsos and Lin (1995) proposed FastMap as metric analogue to the PCA. Imagine that the points are in a Euclidean space. • Select two pivot pointsxa and xb that are far apart. • Compute a pseudo-projection of the remaining points along the “line”xaxb . • “Project” the points to an orthogonal subspace and recurse.

  10. FastMap • We want to find e1 first f2 e1 e2 f1

  11. Selecting the Pivot Points The pivot points should lie along the principal axes, and hence should be far apart. • Select any point x0. • Let x1 be the furthest from x0. • Let x2 be the furthest from x1. • Return (x1, x2). x2 x0 x1

  12. Pseudo-Projections xb Given pivots (xa , xb ), for any third point y, we use the law of cosines to determine the relation of y along xaxb . The pseudo-projection for y is This is first coordinate. db,y da,b y cy da,y xa

  13. “Project to orthogonal plane” xb cz-cy Given distances along xaxb we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem. Using d ’(.,.), recurse until k features chosen. dy,z z y xa y’ z’ d’y’,z’

  14. Compute the next coordinate • Now, we have projected all objects into a subspace orthogonal to first dimension (line xa,xb) • We can apply recursively FastMap on the new projected dataset: FastMap(k-1, d’, D)

  15. Embedding using ML • We can try to use some learning techniques to “learn” the best mapping. • Works for general metric spaces, not only “Euclidean spaces” • Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, George Kollios: BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1): 89-104 (2008)

  16. x1 x2 x3 xn BoostMap Embedding database

  17. x1 x2 x3 xn x1 x2 x3 x4 xn Embeddings database Rd embedding F

  18. x1 x2 x3 xn x1 x2 x3 x4 xn q Embeddings database Rd embedding F query

  19. x1 x2 x3 xn x1 x2 x3 x4 xn q q Embeddings database Rd embedding F query

  20. x2 x3 x1 xn x4 x3 x2 x1 xn q q Embeddings • Measure distances between vectors (typically much faster). database Rd embedding F query

  21. x2 x3 x1 xn x4 x3 x2 x1 xn q q Embeddings • Measure distances between vectors (typically much faster). • Caveat: the embedding must preserve similarity structure. database Rd embedding F query

  22. Reference Object Embeddings original space X

  23. Reference Object Embeddings r original space X r: reference object

  24. Reference Object Embeddings r original space X r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

  25. Reference Object Embeddings r F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

  26. Reference Object Embeddings • F(r) = D(r,r) = 0 r F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

  27. Reference Object Embeddings • F(r) = D(r,r) = 0 • If a and b are similar, their distances to r are also similar (usually). r b a F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

  28. Reference Object Embeddings • F(r) = D(r,r) = 0 • If a and b are similar, their distances to r are also similar (usually). r b a F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

  29. F(x) = D(x, Lincoln) F(Sacramento)....= 1543 F(Las Vegas).....= 1232 F(Oklahoma City).= 437 F(Washington DC).= 1207 F(Jacksonville)..= 1344

  30. F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)

  31. F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)

  32. Basic Questions What is a good way to optimize an embedding?

  33. Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston)) What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?

  34. Key Idea • Embeddings can be seen as classifiers. • Embedding construction can be seen as a machine learning problem. • Formulation is natural. • We optimize exactly what we want to optimize.

  35. F Rd original space X Ideal Embedding Behavior a q Notation: NN(q) is the nearest neighbor of q. For any q: if a = NN(q), we want F(a) = NN(F(q)).

  36. A Quantitative Measure b a q If b is not the nearest neighbor of q, F(q) should be closer to F(NN(q)) than to F(b). For how many triples (q, NN(q), b) does F fail?

  37. A Quantitative Measure a q F fails on five triples.

  38. b a q Embeddings Seen As Classifiers Classification task: is q closer to a or to b?

  39. b a q Embeddings Seen As Classifiers Classification task: is q closer to a or to b? • Any embedding F defines a classifier F’(q, a, b). • F’ checks if F(q) is closer to F(a) or to F(b).

  40. b a q Classifier Definition Classification task: is q closer to a or to b? • Given embedding F: X  Rd: • F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. • F’(q, a, b) > 0 means “q is closer to a.” • F’(q, a, b) < 0 means “q is closer to b.”

  41. Classifier Definition Goal: build an F such that F’ has low error rate on triples of type (q, NN(q), b). • Given embedding F: X  Rd: • F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. • F’(q, a, b) > 0 means “q is closer to a.” • F’(q, a, b) < 0 means “q is closer to b.”

  42. 1D Embeddings as Weak Classifiers • 1D embeddings define weak classifiers. • Better than a random classifier (50% error rate). • We can define lots of different classifiers. • Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier?

  43. 1D Embeddings as Weak Classifiers • 1D embeddings define weak classifiers. • Better than a random classifier (50% error rate). • We can define lots of different classifiers. • Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier? Answer: use AdaBoost (uses boosting) • AdaBoost is a machine learning method designed for exactly this problem.

  44. Fn F2 F1 Using AdaBoost original space X Real line • Output: H = w1F’1 + w2F’2 + … + wdF’d . • AdaBoost chooses 1D embeddings and weighs them. • Goal: achieve low classification error. • AdaBoost trains on triples chosen from the database.

  45. From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output What embedding should we use? What distance measure should we use?

  46. From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)).

  47. D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi| d From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). Distance measure

  48. D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi| d From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). Distance measure Claim: Let q be closer to a than to b. H misclassifies triple (q, a, b) if and only if, under distance measure D, F maps q closer to b than to a.

  49. i=1 i=1 i=1 d d d Proof H(q, a, b) = = wiF’i(q, a, b) = wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|) = (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

  50. Significance of Proof • AdaBoost optimizes a direct measure of embedding quality. • We have converted a database indexing problem into a machine learning problem.

More Related