1 / 6

DISC-Finder: A distributed algorithm for identifying galaxy clusters

DISC-Finder: A distributed algorithm for identifying galaxy clusters. Two galaxies are “friends” if they are close to each other. We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges. We need to identify its connected components.

ailsa
Télécharger la présentation

DISC-Finder: A distributed algorithm for identifying galaxy clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISC-Finder:A distributed algorithmfor identifying galaxy clusters

  2. Two galaxies are “friends” if they are close to each other • We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges • We need to identify its connected components Friends-of-Friends (FoF) technique: Identification of galaxy clusters Sequential algorithms • Exact: O((n∙ log n)1.5) • Approximate: O(n)

  3. Divide the space into “slightly overlapping” cubes Load balancing: • Randomly select a subset of galaxies • Apply the kd-tree construction to build a balanced partition for the subset • Use it for the full set of galaxies • Distributed computation: Apply a sequential FoF algorithm to find the clusters within each cube • Use any sequential FoF • Allocate different cores to cubes • Identify cross-cube edges and merge the respective clusters • Apply the union-find algorithm to thegalaxies in the cube overlaps Distributed procedure

  4. REDUCER MAPPER MAPPER MAPPER REDUCER REDUCER Distributed procedure localclusters galaxysets apply localsequential FoF divide the spaceinto cubes

  5. Advantages • Scalable: We can apply it to massive datasets and use all available cores • Black-box use of a sequential FoF:We can utilize any FoF algorithm • Hadoop friendly: We have mapped all main operations into the Hadoop framework, which has resulted in very compact code (800 lines)

  6. 14,800 mlngalaxies 500 mlngalaxies Scalability Time (min) 240 SCALES TO LARGE DATA SETS EFFECTIVELY USES AVAILBLE CORES 60 1000 mlngalaxies 15 4 8 16 32 64 256 128 Number of cores

More Related