I2.2 Large-Scale Information Network Processing Mid-Year Report

I2.2 Large-Scale Information Network Processing Mid-Year Report CharuAggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan (UCSB)

Task Setting • Indexing, Partitioning, and Distributed Processing • on Time-Varying Networks

INARC I2.2 Mid-Year Report Objectives Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science Key Technical Innovations: Dynamic graph indexing models and structures Scalable graph processing Graph partition overlapping and re-balancing theory Primary Members Xifeng Yan (UCSB), Ambuj Singh (UCSB), CharuAggarwal (IBM), Christos Faloutsos (CMU) Collaborative Members Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army

I2.2: Large-Scale Information Network Processing • Key Objective: • Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science • Deliverables: • Q1: Data collection and cleaning for graph indexing and distributed graph computing • Q2: Design graph indices with time-varying concerns • Q3: Design and test distributed graph computing strategies • Q4: Hypotheses validation and research paper submission • Impact: • Provide fast, scalable, and linked information access to soldiers and commanders • Key Technical Innovations • Dynamic graph indexing models and structures to resolve graph queries in time-varying information networks • Query cost models for distributed graph processing • Graph partition overlapping and re-balancing theory to (1) improve locality of data for parallel computing, and (2) accommodate dynamic network data updates and query workload changes • Self evolving distributed graph processing environment to adjust graph partitions dynamically

Large-Scale Information Network Processing: Inventscalable information network infrastructure Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks The models and the proposed experimental systems provide fundamental analysis of How indexing of dynamic network data affects query performance, How graph partitioning schemes affect distributed query processing, How the models and laws of real networks affect the design of graph indexing and partitioning strategy Advance State-of-the-Art Network Science 5 5

Subtask 1: Graph Index and Search (UCSB, IBM) Fast access and processing of time-varying information networks is the key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes! Subtask 2: Graph over MapReduce (CMU) To process overwhelming amount of data on the Web, social networks, emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods. Needed by any large-scale network data processing including information, social and communication networks Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU) Military information is often distributed in many devices, distributed graph processing run graph algorithms without putting all data together in the same machine Military Relevance 6 6

Subtask 1: Graph Index and Search • Indexing Methods for Large Scale Static and Dynamic Networks • Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao (UIUC), and Han (UIUC)) • Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan (UCSB), Yan (UCSB)) • Dynamic structural index for label-based queries (Aggarwal (IBM) and Li (UCSB)): SDM 2011 accepted. • Analysis of significant substructures in time-varying networks (Singh (UCSB) et al.) • Find highest scoring substructures combines structure and time

gDensity: Model-Based Indexing • Problem definition (labeled proximity search) • Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query. Q=(a, b, c) Q=(“reconnaissance”, “biometric matching”, “failure modeling”) d=3 d=2 • Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

gDensity: Ideas and Results • Can we do better? Which one is more promising? u’s density distribution v’s density distribution • 10 – 300 times faster • Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

Graph Search: a Model-Based Approach • Ideas • Use information propagation model to propagate labels in information networks • Convert vertices to vectors • Align sets of vectors (a) linkedin (b) facebook • Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges • Information Propagation Model A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11 • Align two networks

Search Algorithm • Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε. • Step 2: Discard the labels of the unmatched nodes in the target graph. • Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further. u1 u2 v1 f v3 v2 u3 v4 u4 u5 u6 Q G

Dynamic Updates Dynamic Update in Index vs. Re-indexing (DBLP) • Indexing is performed for h=2 hops.

Investigate graph properties and graph algorithms using MapReduce Spectral Analysis of Billion-Scale Graphs Patterns on the Connected Components of Terabyte-Scale Graphs Study the limitation of the MapReduce architecture on processing network-centric data Using the discovered patterns of terabyte-scale real-life graphs. Subtask 2: Graph Over MapReduce 13 13

Spectral Analysis of Billion-Scale Graphs • Billion-Scale Eigen-solver • Computes top-k eigen-values and eigenvectors • Find anomalies in large graphs. • Many application: SVD, triangle counting, spectral clustering, … • A careful implementation of Lanczos on hadoop can give excellent accuracy as well as scalability • Contribution: • HEigen: a billion-scale eigensolver which can handle 1000x larger matrices than previous methods • Application of the eigensolver on the twitter graph helps us spot abnormal users (adult advertisers) U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

Patterns on the Connected Components of Terabyte-Scale Graphs • A large graph is composed of many connected components • Q1: static patterns? • Q2: evolution patterns? • Q3: model? Count Metric: Graph Fractal Dimension(G): log |E| / log |V| Size YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010

Subtask 3: Graph Partition for Distributed Graph Computing • Are typical techniques efficient for graph queries? • Graph partitioning and distribution techniques (e.g., Pregel) Limitations: • Unavailable to the public • Unbalanced workload due to skewed uniformly distributed graph queries. • Communication overhead due to inter‐machine (cross partition) communication. • Sedge: distributed graph processing • Model-based Graph Partitioning Techniques • First-of-Its Kind Distributed Graph Computing Platform for Information, Social, and Communication Networks • Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing • submitted to VLDB 2011

Graph Partition Models • Dynamic Workload: Replicate Partitions - Replicate partitions that are intensively accessed by many queries • Complementary Partitions • - Generate partitions sets that are complementary to each other • Dynamic Workload: New Partitions • - Generate new partitions that are intensively accessed by many cross-partition queries • Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing • submitted to VLDB 2011

Graph partitioning with region constraint • Optimal Solution: • Where: • NP-hard

Global Optimization • Before each iteration, increase the weight of edges in each region wrt. its priority … • Iteratively repartition the graph …

Graph Partition for Distributed Graph Computing • 10,000 random queries. Increase partition number by adding more machines. # of Machines vs. Throughput Improvement Ratio

Collaborations and Path Ahead • Collaborations within I2 • Monthly meeting • Strong connection between I2.1 and I2.2: One problem, two sides. information network processing on DTN and Clusters • (I2.1) Work with ArunIyengar and MudahakarSrivatsa (IBM), who has done much work on DTN and Storage. Shengqi Yang will intern at IBM this summer. • Collaborations with researchers in other networks • (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing. U Kang was a summer intern at IBM • (E1.1, R2.3) Work with JieBao (RPI), on RDF queries using neighborhood-based graph search. • (T2.3) Work with VikasKawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas • Graph search has connection with (T2.4) M. Goldberg’s work on trust structure. • Work with Sachi Desai (Army) on graph query language/system. 21

Next Six Months and Path Ahead to 2012 • Continue research on large-scale information network processing (more specific) • Graph indexing on multiple time-varying graph snapshots • Compression-based, Model-based Info Network Processing (3) Edge lay-out on Hadoop file system for better compression and better performance (4) Complementary graph partitioning theories. • Other research topics planned • Models and methods for building complex graph queries • Models and methods for routing complex graph queries to data sources (for both I2.1 and I2.2) • Tensor analysis on Hadoop 22

Research Papers (Accepted/Published) A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011. Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11 C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011 U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia. U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11 U Kang, DuenHorngChau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany.

Research Papers Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal PetkoBogdanov, MisaelMogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks in Time, to be submitted to VLDB Journal C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk Resident Graphs, Research Report, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal

Big Picture Stage 1: How to distribute graphs (we are here) Stage 2: How to construct queries Stage 3: How to execute/route queries Make Information Network Accessible by Soldiers and Commanders

Questions?

I2.2 Large-Scale Information Network Processing Mid-Year Report

I2.2 Large-Scale Information Network Processing Mid-Year Report

Presentation Transcript

Mid-year Status Report

Complete Network Analysis Network Connections: Large-Scale network structure

Large-scale Processing with MapReduce

Large-Scale Data Processing with MapReduce

Large-scale cGPS processing and prototyping solutions

LARGE SCALE

Large-scale Data Processing Challenges

Mid-year Report

Principal’s Mid-Year Report

ISSA Mid-Year Report

A Large-Scale Network Testbed

Large scale data processing

Data Retreat – ELA Mid-Year Report

Principal’s Mid-Year Report

The Large Scale Clinical Trial Network

Large – Scale Sensor network

Large scale

Service Deployment in Large Scale Network

Large Scale Data Processing with DryadLINQ

Large Scale Network Growth Techniques

MID-YEAR EVALUATION REPORT OVERVIEW