390 likes | 494 Vues
Memory Bounded Inference on Topic Models. Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009. Outline. Motivation/Background: what is “memory bounded lifelong learning?” Review of the Variational Topic Model
E N D
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009
Outline • Motivation/Background: what is “memory bounded lifelong learning?” • Review of the Variational Topic Model • Note: The title “Topic Model” is misleading, this paper actually applies an LDA based model to image analysis. • Introduce the Memory Bounded Variational Topic Model (2 step process) • Defining “Clumping” and “Grouping” • Model Building Step • Compression Step • Empirical Results • Conclusion
Background/Motivation • The goal of this paper is to create a class of algorithms which can • Expand model complexity as more data arrives, • Do so in a speedy and memory efficient manner. • Traditional non-parametric algorithms, including DP, can address the first issue but not the second. • DP and its derivatives are batch algorithms. They require the entire dataset be cached in memory, and each time the dataset is expanded, they infer a new model based on ALL data. • This paper proposes a new class of models which produces results similar to those of batch algorithms while meeting both initial requirements.
Background/Motivation • Memory Bounded Models compress the data and the internal representation of the model after each dataset expansion according to a set memory bound. • The net effect of this is that the memory requirements of the model scale much more gradually over time than a similar batch processing model. • This paper demonstrates results with a memory bounded approximation to LDA which validates the author’s claims.
Review: Variational Topic Model • Consider the following Bayesian topic model where, unlike the Blei’s form of LDA, we assume that is in the exponential family and has the following prior
Review: Variational Topic Model • The variational posteriors over are then Dirichlet (as in LDA) Multinomial distribution over topic indicators Updates of Variational Parameters
Review: Variational Topic Model • As in LDA, we assume Gamma priors on the hyperparameters of the Dirichlet • And the hyperparameter update equations can be written
“Clumping” and “Grouping” • Tie some of the within and across different documents (images) to a “clump distribution” . • Further, tie some of the document specific topic distributions to a “topic group” specific distribution . • Equivalently
“Clumping” and “Grouping” • Definitions • Let be the number of documents in a group. • Let be the number of words* in a word clump. • Let be the number of words in a clump c and group s. • Let • The update equations can be derived similarly to the original topic model * “Words” is used loosely; it may mean, as in this case, an image patch.
“Clumping” and “Grouping” • Updates on the hyperparameters are also similar to the original topic model • Since a clump may be used to represent more than one “word”, the effect of a clump on the parameters must be scaled appropriately.
Memory Bounded Variational Topic Model • Suppose the data arrives as in time sequential sets denoted “epochs”. • The proposed family of models will estimate a new model with the current epoch of data using the previous epoch’s model (model building) as a starting point, then compress the model and data by assigning words and documents to clumps and groups (model compression). Historical data will be represented as clumps and groups. • The data level distributions and the topic level distributions are then discarded, while the clump sufficient statistics and estimated model are retained as the initial model for the next epoch.
Model Building Variational Parameters Hyperparameters
Model Building • In this step, the goal is to build the model exactly as in the non-memory bounded case, based on the model from the previous epoch. • The purpose of this step is to assign new data to specific topics, which is necessary for the following steps.
Model Building • This step ranks which components to split or merge based on the Split/Merge EM algorithm. • SMEM was originally proposed by Ueda et al. to help mixture models avoid local maxima by increasing the number of clusters in under-populated regions, and decreasing the number of clusters in overpopulated regions.
Model Building • The idea behind SMEM is to evaluate whether splitting a cluster or merging two clusters will increase the expected log-likelihood function. • This paper uses only the split/merge criterion from Ueda et al., and not the SMEM algorithm.
Model Building • Merge Criteria • Where are the parameters estimated from standard EM and is a vector of posteriors for each data to the ith mixture component. • Merge those components which are most similar.
Model Building • Split Criteria • Define the KL divergence between the estimated mixture component and the empirical data density of data belonging to component k. • Split the component which most disagrees with its empirical density.
Model Building • Each document in the current epoch is assumed to be in its own group, and each data point in its own cluster. Notice that the update equations then collapse to the original equations.
Model Building • Each topic is split along into two daughter components, and the data from the current epoch with high probability to that topic is then hard assigned to one of the two daughter topics.
Model Building • To summarize: the clumps and new data are considered together during model inference, but the effect of a clump’s assignment to a topic must be scaled to represent all the data contained in the clump. In this way, the historical data points in each clump don’t need to be considered.
Model Building • Merges are done by choosing two topics, taking the data with the highest posterior probability for those two topics and combining them into a new meta-topic.
Model Building • In this step, a specific action, either split a certain topic or merge two topics, is chosen such that it maximizes the change in the lower bound.
Model Building • After taking the action of either split or merge, the algorithm is run again considering all the data and clumps, as well as all the mixture components.
Model Compression • This is done so that later, for each cluster, only the data with high posterior probability for that cluster is considered.
Model Compression • Define the artificially scaled data sample size • Define the artificially scaled number of documents where and are the expected lifetime data size and document number.
Model Compression Variational Parameters Hyperparameters
Model Compression • Due to the scaled sample size and document count, the lower bound equation is also modified
Model Compression • A key point here is that the splitting procedure is only to extract new data clumps, not to refine the model building. • The splits from this recursive procedure are cached temporarily, and extracted at the end as clumps when the memory bound has been reached.
Model Compression • The data is not cached at the end of each epoch. Instead, each clump stores sufficient statistics for full covariance Gaussian mixture components (in this application). • The total memory needed is Where d is the feature dimensionality and |S| is number of document groups.
Model Compression • Another way to control memory usage is to merge document groups through the use of the following heuristic • Initially each document is in its own “group”, but after the first iteration, we are in a position to group multiple documents together into groups.
Experimental Results • 2 different tests were conducted: an image segmentation task, and an object recognition/retrieval task. • The goal is not to show state-of-the-art performance, but rather to compare accuracy and memory usage between a conventional algorithm and its memory bounded equivalent.
Experimental Results • The first experiment is to segment images. • Dataset: 435 facial images, from “Caltech-Easy” dataset. • Pixels are represented as a 5 dimensional feature vector: {x,y,color1,color2,color3}. • Each image was scaled to 200 x 160. 435 images, with 5 dimensional features means 69,600,000 real numbers to store. • The baseline non-memory bounded algorithm ran out of memory after storing 30 images.
Experimental Results • (a) Original image • (b) Result from baseline method. • (c) Result from memory bounded model using 30% of the dataset, 125 clumps, and image grouping. • (d) Result from memory bounded model using 30% of the dataset, 125 clumps, and no image grouping. (a) (b) (c) (d)
Experimental Results • Top row show sample clumps, with the colors corresponding to different clumps. The bottom row are the corresponding segmentations based on the clumps.
Experimental Results • Results when trained on the entire 435 image dataset. No document merges were used. • Total memory requirement was 257 Mb, compared to 9.3 Gb for the baseline batch processing model. • Each epoch consisted of 10 images, resulting in 10 learning rounds. Total runtime was 15 hours.
Experimental Results • The second test was object recognition/retrieval. • Training set consisted 3000 images, test set consisted of 1000 images. • Features: 128 dimensional sift features from 500 randomly chosen locations in each image. • The batch model was not compared here because it could not handle the data size. • Epochs consisted of 60 images at a time.
Experimental Results • A single topic model was built for all training images, and a score for each test image was computed • A nearest neighbor (1-NN) algorithm was then used for classification.
Experimental Results • When the 1-NN accuracy is plotted against the memory bound, the accuracy seems to saturate for M > 10000. • When M is fixed at 10000, the performance drops significantly only when a small number of documents is permitted.
Conclusion • The authors introduced a new family of variational Bayesian models which do not have limited capacity and do not require the entire historical dataset be cached. • This method speeds up inference, and allows for processing of much larger datasets than could be previously considered, with negligible (claimed) loss on accuracy. • A drawback of this algorithm is that at each epoch, inference (either complete or restricted) must be performed several times.