Efficient Multi-core Structural SVM Training for Global Decision Making

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With VivekSrikumar and Dan Roth

Motivation • Decisions are structured in many applications. • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • It is essential to make coherent decisions in a way that takes the interdependencies into account. • E.g., Part-of-Speech tagging (sequential labeling) • Input: a sequence of words • Output: Part-of-speech tags {NN, VBZ, …} “A cat chases a mouse” => “DT NN VBZ DT NN”. • Assignment to can depend on both and . • Feature vector defined on both input and output variables: • can be “: Cat”, “ VBZ-VBZ”.

Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Improvement over no inference: 2-5%

Structured Learning and Inference • Structured prediction: predicting a structured output variable y based on the input variable x. • variables form a structure:sequences, clusters, trees, or arbitrary graphs. • Structure comes from interactions between the output variables through mutual correlations and constraints. • TODAY: • How to efficiently learn models that are used to make global decisions. • We focus on training a structural SVM model. • Various approaches have been proposed [Joachims et. al. 09, Chang and Yih 13, Lacoste-Julien et. al. 13] but they are single-threaded. • DEMI-DCD: a multi-threaded algorithm for training structural SVM.

Outline • Structural SVM: Inference and Learning • DEMI-DCD for Structural SVM • Related Work • Experiments • Conclusions

Structured Prediction: Inference Features on input-output Weight parameters (tobe estimated during learning) Set of allowed structures often specified by constraints Features vector on input-output Inference constitutes predicting the best scoring structure. Efficient inference algorithms have been proposed for some specific structures. Integer linear programing (ILP) solver can deal with general structures.

Structural SVM For all samples and feasible structures Score of gold structure Score of predicted structure Loss function Slack variable Given a set of training examples . Solve the following optimization problem to learn w is the scoring function used in inference. We use (L2-loss structural SVM).

Dual Problem of Structural SVM Corresponds to different Quadratic programming with bounded constraints where Number of variables can be exponentially large. Relationship between and For linear model: maintain the relationship between and though out the learning process [Hsieh et.al. 08].

Active Set • Maintain an active set of dual variables: • Identify that will be non-zero in the end of optimization process. • In a single-thread implementation, training consists of two phases: • Select and maintain (active set selection step). • Requiresolving a loss-augmented inference problem for each • Solving loss-augmented inferences is usually the bottleneck. • Update the values of (learning step). • Require (approximately) solving a sub-problem. • Related to cutting-plane methods.

Overview of DEMI-DCD DEMI-DCD: Decouple Model-update and Inference with Dual Coordinate Descent. • Let be #threads, and split training data into • Active set selection (Inference) thread : select and maintain the active set for each example in • Learning thread: loop over all examples and update model . • and are shared between threads using shared memory buffers. Learning Active Set Selection 12

Learning Thread • Sequentially visit each instance and update • To update , solve the following one variable sub-problem • Then and • Shrinking heuristic: remove from if = 0 and 13

Synchronization • DEMI-DCD requires little synchronization. • Only learning thread can write , and only one active set selection thread can modify . • Each thread maintains a local copy of and copies to/from a shared buffer after iterations. 14

A parallel Dual Coordinate Descent Algorithm A Master-Slave architecture (MS-DCD): • Given processors: split data into parts. • At each iterations: • Master sends current model to slave threads. • Each slave thread solves loss-augmented inference problems associated with a data block and updates the active set. • After all slave threads finish, master thread updates the model according to the active set. • Implemented in JLIS. Master Sent current w Slave Slave Slave Slave Solve loss-augmented inference and update A Update w based on A Master 16

Structured Perceptron and its Parallel Version Structured Perceptron[Collins 02]: • At each iteration: pick an example and find the best structured output according to the current model . • Then update with a learning rate SP-IPM [McDonald et al 10]: • Split data into parts. • Train Structured Perceptron on each data block in parallel. • Mixed the models using a linear combination. • Repeat Step 2 and use the mixed model as the initial model. 17

Experiment Settings • POS tagging (POS-WSJ): • Assign POS label to each word in a sentence. • We use standard Penn Treebank Wall Street Journal corpus with 39,832 sentences. • Entity and Relation Recognition (Entity-Relation): • Assign entity types to mentions and identify relations among them. • 5,925 training samples. • Inference is solved by an ILP solver. • We compare the following methods: • DEMI-DCD: the proposed method. • MS-DCD: A master-slave style parallel implementation of DCD. • SP-IPM: parallel structured Perceptron.

Convergence on Primal Function Value Log-scale Relative primal function value difference along training time POS-WSJ Entity-Relation

Test Performance SP-IPM converges to a different model Test Performance along training time POS-WSJ

Test Performance Test Performance along training time Entity-Relation Task Entity F1 Relation F1

Moving Average of CPU Usage DEMI-DCD fully utilizes CPU power CPU usage drops because of the synchronization POS-WSJ Entity-Relation

Conclusion We proposed DEMI-DCD for training structural SVM on multi-core machine. The proposed method decouples the model update and inference phases of learning. As a result, it can fully utilize all available processors to speed up learning. Software will be available at:http://cogcomp.cs.illinois.edu/page/software Thank you.

Efficient Multi-core Structural SVM Training for Global Decision Making

Efficient Multi-core Structural SVM Training for Global Decision Making

Presentation Transcript

Multi-core architectures

Multi-core computing

Multi-Core Systems

Multi-core architectures

Multi-Core Computing

マルチコア /Multi-Core

Multi-core Programming

Multi-core Programming

Multi-core processors

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-Core Computing

Multi-Core Development

Curriculum Learning for Latent Structural SVM

FPGA Multi-core

Multi-Core Architectures

Multi-core CPU’s

FPGA Multi-core

Multi-core Structural SVM Training Kai-Wei Chang, Vivek Srikumar and Dan Roth

Why multi-threading/multi-core?