1 / 50

Matrix Factorization

Matrix Factorization. Recovering latent factors in a matrix. m movies. n users. V[ i,j ] = user i’s rating of movie j. Recovering latent factors in a matrix. m movies. m movies. ~. n users. V[ i,j ] = user i’s rating of movie j. KDD 2011. talk pilfered from  ….

chace
Télécharger la présentation

Matrix Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix Factorization

  2. Recovering latent factors in a matrix m movies n users V[i,j] = user i’s rating of movie j

  3. Recovering latent factors in a matrix m movies m movies ~ n users V[i,j] = user i’s rating of movie j

  4. KDD 2011 talk pilfered from  …..

  5. Recovering latent factors in a matrix r m movies m movies ~ H W V n users V[i,j] = user i’s rating of movie j

  6. for image denoising

  7. Matrix factorization as SGD step size

  8. Matrix factorization as SGD - why does this work? step size

  9. Matrix factorization as SGD - why does this work? Here’s the key claim:

  10. Checking the claim • Think for SGD for logistic regression • LR loss = compare y and ŷ= dot(w,x) • similar but now update w (user weights) and x (movie weight)

  11. What loss functions are possible? N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies “generalized” KL-divergence

  12. What loss functions are possible?

  13. What loss functions are possible?

  14. ALS = alternating least squares

  15. KDD 2011 talk pilfered from  …..

  16. Similar to McDonnell et al with perceptron learning

  17. Slow convergence…..

  18. More detail…. • Randomly permute rows/cols of matrix • Chop V,W,H into blocks of size d x d • m/d blocks in W, n/d blocks in H • Group the data: • Pick a set of blocks with no overlapping rows or columns (a stratum) • Repeat until all blocks in V are covered • Train the SGD • Process strata in series • Process blocks within a stratum in parallel

  19. More detail…. Z was V

  20. More detail…. M= • Initialize W,H randomly • not at zero  • Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch” • Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size: • increase step size when loss decreases (in an epoch) • decrease step size when loss increases • Implemented in Hadoop and R/Snowfall

  21. Wall Clock Time8 nodes, 64 cores, R/snow

  22. Number of Epochs

  23. Varying rank100 epochs for all

  24. Hadoop scalability Hadoop process setup time starts to dominate

  25. Hadoop scalability

More Related