140 likes | 253 Vues
This paper presents a novel approach to decision tree induction in large distributed computing environments, aimed at addressing the challenges of costly data-intensive and synchronization-heavy computations. By leveraging a hierarchical architecture, the proposed algorithm simplifies communication and synchronization, allowing for efficient handling of vast datasets. The approach focuses on collecting partial global statistics per attribute, which significantly reduces communication bandwidth while maintaining performance. Key performance metrics demonstrate a remarkable 99% reduction in communication costs. Future work will explore text mining and incremental algorithms in real-world grid systems.
E N D
With: Amir Bar-Or, Ran Wolff, Daniel Keren Decision Tree Induction in Hierarchic Distributed Systems
Motivation Large distributed computation is costly • Especially data intensive and synchronization intensive ones; e.g., data mining • Decision tree induction: • Collect global statistics (thousands) for every attribute (thousands) in every tree node (hundreds) • Global statistics – global synchronization
Motivation Hierarchy Helps • Simplifies synchronization • Synchronize on each level • Simplifies communication • An “industrial strength” architecture • The way real systems (including grids) are often organized
Motivation Mining highly dimensional data • Thousands of sources • Central control • Examples: • Genomically enriched healthcare data • Text repositories
Objectives of the Algorithm • Exact results • Common approaches would either • Collect a sample of the data • Build independent models at each site and then use centralized meta-learning atop of them • Communication efficiency • Naive approach: collect exact statistics for each tree node would result in GBytes of communication
Decision tree in a Teaspoon • A tree were at each level the learning samples are splitted according to one attribute’s value • Hill-climbing heuristic is used to induce the tree • The attribute that maximized a gain function is taken • Gain functions: Gini or Information Gain • No real need to compute the gain
Main Idea • Infer deterministic bounds on the gain of each attribute • Improve bounds until best attribute is provenly better than the rest • Communication efficiency is achieved because bounds require just limited data • Partial statistics for promising attributes • Rough bound on irrelevant attributes
Hierarchical Algorithm • At each level of the hierarchy • Wait for reports from all descendants • Contain upper and lower bounds on the gain of each attribute, number of samples from each class • Use descendant's report to compute cumulative bounds • If no clear separation, request descendants to tighten bounds by sending more data • At worst, all data is gathered
Deterministic Bounds • Upper bound • Lower bound
Performance Figures • 99% reduction in communication bandwidth • Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy • Percent declines with hierarchy level
Performance Figures • 99% reduction in communication bandwidth • Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy • Percent declines with hierarchy level
More Performance Figures • Larger datasets require lower bandwidth • Outlier noise is not a big issue • White noise even better
More Performance Figures • Larger datasets require lower bandwidth • Outlier noise is not a big issue • White noise even better
Future Work • Text mining • Incremental algorithm • Accommodation of failure • Testing on a real grid system • Is this a general framework?