Decision Tree Induction in Hierarchic Distributed Systems

With: Amir Bar-Or, Ran Wolff, Daniel Keren Decision Tree Induction in Hierarchic Distributed Systems

Motivation Large distributed computation is costly • Especially data intensive and synchronization intensive ones; e.g., data mining • Decision tree induction: • Collect global statistics (thousands) for every attribute (thousands) in every tree node (hundreds) • Global statistics – global synchronization

Motivation Hierarchy Helps • Simplifies synchronization • Synchronize on each level • Simplifies communication • An “industrial strength” architecture • The way real systems (including grids) are often organized

Motivation Mining highly dimensional data • Thousands of sources • Central control • Examples: • Genomically enriched healthcare data • Text repositories

Objectives of the Algorithm • Exact results • Common approaches would either • Collect a sample of the data • Build independent models at each site and then use centralized meta-learning atop of them • Communication efficiency • Naive approach: collect exact statistics for each tree node would result in GBytes of communication

Decision tree in a Teaspoon • A tree were at each level the learning samples are splitted according to one attribute’s value • Hill-climbing heuristic is used to induce the tree • The attribute that maximized a gain function is taken • Gain functions: Gini or Information Gain • No real need to compute the gain

Main Idea • Infer deterministic bounds on the gain of each attribute • Improve bounds until best attribute is provenly better than the rest • Communication efficiency is achieved because bounds require just limited data • Partial statistics for promising attributes • Rough bound on irrelevant attributes

Hierarchical Algorithm • At each level of the hierarchy • Wait for reports from all descendants • Contain upper and lower bounds on the gain of each attribute, number of samples from each class • Use descendant's report to compute cumulative bounds • If no clear separation, request descendants to tighten bounds by sending more data • At worst, all data is gathered

Deterministic Bounds • Upper bound • Lower bound

Performance Figures • 99% reduction in communication bandwidth • Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy • Percent declines with hierarchy level

More Performance Figures • Larger datasets require lower bandwidth • Outlier noise is not a big issue • White noise even better

Future Work • Text mining • Incremental algorithm • Accommodation of failure • Testing on a real grid system • Is this a general framework?

Decision Tree Induction in Hierarchic Distributed Systems