James Abello, DyDAn – Rutgers University Portions of this work have been done jointly with

Universal Information Graphs via Hierarchical Graph Maps and Graph Fusion James Abello, DyDAn – Rutgers University Portions of this work have been done jointly with J. Crobak (Rutgers) , R. Dementiev(U. Karslhube) , I. Khan(Rutgers), H. Schulz (U. Rostock) Partially supported by LLNL (in consultation with Scott Kohn )

Universal Information Graphs Encode multiple relations among a set of entities. Usually, each entity has associated with it a set of possibly labeled attributes.Entities: people, organizations, places, events, documents, telecommunication activities, computer addresses, web page descriptors, images, videos, parts of speech, etc. Entitiesare associated when they co-occur in a “logical unit” of interest. Associated entity pairs get tagged by a vector of labels and by a weight vector that measures the strength of the associations.Patternscorrespond to “special” subsets of entities and their inter relations.

Overall Goal To efficientlyuncover and classify patterns that can be used as triggers to take preventive actions against potential society threats How ? 1.Design similarity measures among data entities via a “semantic” dot product. This amounts to quantification via some weighting mechanism of the set of attributes “shared” by a set of entities. *The main question is how to learn these weights and how to determine levels of agreement that correspond to cluster –pattern formation in the data. Instead of some form of relational data base Use vertex and edge weighted labeled multi-graphs. 2. Since scale and interactivity are essential we opt for Hierarchical Graph Map methods that are a. I/O efficient and b. parameterized by the amount of RAM and real state screen available to an analyst posing queries (semi-external algorithms).

Take a graph Create a hierarchical map of the graph Abstractthe graph at will What have we done? A Client- Server SystemThe server uses our C++ library hgv. It creates hierarchical maps of semi-external graphs (up to billion edges on an 8Gb RAM commodity machine). It provides parameterized abstractions of the data.The client is used to interactively navigate server graph answers. It builds up on our previous work on graph visualization systems ( GraphView ).

Major Computations : Structural Groupings • Connected components • Peripheral Trees • Biconnected components • Clusters based on topology and labeling information • Graph Abstractions ( via Sparse Cuts )

Server Side Hierarchy Creation • INPUT : Weighted simple undirected input graph G = (V,E) • |E| possibly larger than will fit in RAM. • Apply external memory weighted contraction algorithm (MST or Matching)1. • Find antichain/slice/cut in the hierarchy that just fits in RAM by iteratively contracting edges. Binary hierarchy on semi-external graph Binary hierarchy on internal graph 1 J. Abello, “Hierarchical Graph Maps”, Computer &Graphics, 2004

Visualization Client Interactivity • Process each chunk of IH edges using the hierarchical map obtained via our topological grouping/clustering algorithms. • Starting from the higher level groups/clusters, layout each subgraph on demand. • If we hit the leaf nodes of the current chunk, load the next chunk into memory, process it in the same way and hook it to the hierarchy.

Visualization Samples Wordnet, Sample of General Queries, Terrorist Incidents, From an SQL query to its graph (via PubMed Descriptors).

Global Terrorist Data Base Picks Terrorist Incidents in Country_Name perpetrated byGroup_Name Sample 1: “Terrorist Incidents in Cyprus perpetrated by Lebanese”. Comment: The answer set is very focused Terrorist Incidents in <Country> Sample 2: “Terrorist Incidentsin Bolivia”. Comment: The answer set is split into several connected pieces each with a distinctive characteristic

Global Terrorist Data Base Picks (cont) • Sample 2 “Terrorist Incidentsin Bolivia” (continued) Comment: Notice the connection with the pro-palestinian group.

Global Terrorist Data Base Picks (cont) • Terrorist Incidents with <Attack_Type> and <Fatalities> “Terrorist Incidentswith Kidnapping and Fatalities”. Comment: A more varied answer set. Notice the several connected pieces with distinctive characteristics.

Global Terrorist Data Base Picks (cont) “Terrorist Incidentswith Kidnapping and Fatalities”. Comment: A more varied answer set. Notice the several connected pieces with distinctive characteristics.

Kidnappings with fatalities (cont)

Graph Fusion (supported by LLNL) Problem Statement: Given a collection of entities each with an associated set of attributes the corresponding Universal Graph pre-supposes that we have the ability to efficiently compute a notion of semantic similarity between every pair of entities (a quadratic computation). Graph Fusion is the reverse process, i.e. how much of the Universal Graph can we efficiently recover from its collection of projections into selected subsets of attributes? Typical scenario: an entity may have identities in email, call detail, web pages, blogs, instant messaging, chat rooms, etc. As such it has its own graph neigborhoods on each of these networks. The question is how to “approximate” from these local neighborhoods the neighborhood of this “persona” in the Universal Graph. Our current approach deals with the restricted case in which there is a known Taxonomy for the set of entity attributes of the Universal graph. We have developed some graph fusion mechanisms for this case and have applied them to the PubMed data base (Research supported by LLNL). These methods are based on Spanning Trees and Matchings.

PubMed Data Base SamplesKeyPhrases ‘eyes’ OR ‘vitamin”

Conclusions • Advantages • With parameter tuning and “adequate” Ram resources these type of systems remain interactive even when dealing with very large graphs via graph maps derived from hierarchy trees. Currently, extending it to graph maps derived from hierarchy DAGs. • The approach is applicable to almost any type of graph, regardless of density. It turns out that large degree which is usually considered just a nuisance now becomes a delicate issue. • Parameterized interactivity allows it to run on less powerful systems • Structural grouping/clustering algorithms performs well (removal of subtrees helps a lot) • Needed Improvements/Extensions * Extend the system to directed multi-graphs (currently it handles undirected graphs). * Add Data Streaming Capabilities

Conclusions (cont) • Needed Improvements/Extensions * Extend the system to directed multi-graphs (currently it handles undirected graphs). * Add Data Streaming Capabilities • Improved ways to effectively summarize contents of a group/cluster. Currently, we use a hierarchical frequency based group labeling algorithm. • Need ways to better ‘guide’ users to potentially “interesting” pieces of the data.

Questions? Contact Info • James Abello abello@dimacs.rutgers.edu Related Publications • “Semi-External Induced Subgraphs”, J. Abello and R. Dementiev, in preparation. • “HGV: A C++ Library to compute Hierarchical Graph Views”, J. Abello and J. Crobak, in preparation. • “Name That Cluster”, J. Abello, H. Schulz,, B. Gaudin, C. Tominski, in Infovis 2006, IEEE, Sacramento, CA. • “CVG: Coordinate Graph Visualizations”, J. Abello, C. Tominski, H. Schumann, in Infovis 2006, IEEE, Sacramento, CA.

James Abello, DyDAn – Rutgers University Portions of this work have been done jointly with