100 likes | 214 Vues
This guide explores the implementation of parallelized boosting algorithms to identify strong hypotheses relating attributes to labels. Users can specify configurations via a file that details nodes, memory, iterations, and behavioral classes. In the pre-processing stage, data is formatted, hypotheses are defined, and a parallelizable task framework is established. The training phase utilizes parallel processing to minimize error through weak learners, followed by comprehensive post-processing for result reporting. This approach ensures efficient handling of complex datasets while maximizing resource use.
E N D
Parallelized Boosting Mehmet Basbug Burcin Cakir Ali Javadi Abhari Date: 10 Jan 2013
Motivating Example • Many examples, many attributes • Can we find a good (strong) hypothesis relating the attributes to the final labels? Attributes Labels Examples Table 1. Example Data Format
User Interface • User specifies the desired options in two ways: • Configuration File:Information about the number of nodes/cores, memory, number of iterations. To be parsed by the preprocessor. • Behavioral Classes:Defining the hypotheses "behaviors" --------------------------------------------------------------------------- <configurations.config> --------------------------------------------------------------------------- [Configuration 1] working_directory = '/scratch/pboost/example' data_files = 'diabetes_train.dat' test_files = 'diabetes_test.dat' fn_behavior = 'behaviors_diabetes.py' boosting_algorithm = 'confidence_rated' max_memory = 2 xval_no = 10 round_no = 1000 --------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- <behaviors_diabetes.py> ---------------------------------------------------------------------------------------------------- from parallel_boosting.utility.behavior import Behavioral class BGL_Day_Av(Behavioral) def behavior(self, bgl_m, bgl_n, bgl_e) { return (self.data[:,day_m]+self.data[:,day_m], self.data[:,day_m]) / 3 def fn_generator(self){ bgl_m = list() bgl_n = list() bgl_e = list() for k in range(1, (self.data.shape[1]-4)/3+1) bgl_m = 3*k bgl_n = 3*k+1 bgl_e = 3*k +2 self.insert(bgl_m,bgl_n,bgl_e) } ----------------------------------------------------------------------------------------------------
Pre-Processing • User-defined python classes: to obtain different function behaviors. to obtain a set of hypotheses. • Configuration file: to get the path of the required data and definitions. • Function Definitions Table: to store the hypothesesand make it available to different cores • Hypothesis Result Matrix • Sorting Index Matrix: to save the sorting indices of each example Table 2. Function Definitions Table
Pre-Processing (cont.') Table 3. Function Output Table Applying each function to each example is a parallelizable task. Therefore, another important step that needs to be implemented in the preprocessing part is to read the machine informationfrom the configuration file. Table 4. Sorting Index Table
Training the boosting algorithm slave Sorting index is partitioned Error matrices for each slave Dt h1t Weak Learner (Slave) Calculate error for each combination (hypothesis, labeling, threshold) for the hypothesis in the given set for given distribution over examples(Dt) Return the hypothesis with the least error master Boosting (Master) Start with a distribution over examples(Dt) For each round t=1...T send Dt to each slave receive best hypotheses from each slave(h1t,h2t) find the one with the least error (ht) update Dtusing ht calculate the coefficient at Return the linear combination of hts h2t Dt slave Features - Super fast --- memory based --- single pass through data --- store indexes rather than results(16 bit vs 64 bit) --- LAPACK & numexpr --- embarrassingly parallelized - Several Boosting algorithms - Flexible xval structure
Post-Processing • Combines and reports the collected results • The result after each round of iteration is stored by the master: Set of hypothesis(ht) and their respective coefficients(at), and the error. • Plot training and testing error vs. number of rounds • Plot ROC curve of training and testing error • Confusion matrix showing false/true positives/negatives • Create standalone final classifier • Report running time, amount of memory used, number of cores, ... • Clean up extra intermediary data stored on disk