650 likes | 659 Vues
Mining Networks through Visual Analytics. Incremental Hypothesis Building and Validation. David Auber Romain Bourqui Guy Melançon. CNRS LaBRI UMR 5800 & INRIA Futurs – GRAVITÉ Bordeaux, France. peacokmaps.com. InfoVis CyberInfraStructure – Pajek. “A picture is worth a thousand words”
E N D
Mining Networks through Visual Analytics Incremental Hypothesis Building and Validation David Auber Romain Bourqui Guy Melançon CNRS LaBRI UMR 5800 & INRIA Futurs – GRAVITÉ Bordeaux, France
InfoVis CyberInfraStructure – Pajek • “A picture is worth a thousand words” • Chinese proverb (?)
Graph Viz Framework Tulip • “It’s all visual” • R. Feynman (Nobel prize in Physics)
Voronoï Treemaps • “The purpose of computing is insight not numbers ” • R. Hamming (1973)
Cushion Treemaps
“Visualization uses computer graphics to help provide insight on complicated problems, models or systems” • “Scientific visualization is exploring data and information graphically, gaining understanding and insights into the data” • R.A. Earnshaw (a pioneer in computer graphics, 1973) Munzner’s Hyperbolic Browser
Visualize? • Inselberg – « creator » of parallel coordinates • « Insight through images » • « Goal: Visual Model to Help our Intuition » • « Involves: Geometry, Cognition, Art ? »
Visual graph mining related to security issues • “Recognize” structural properties • Identify key actors • Identify their neighborhood • Community structure • Connectivity between communities • … “Chess players recognize patterns”
Example from NCTC data • Extracted about 8000 incidents from WITS • Identified terrorists groups when possible (directly or through AFP) • Identified countries where incidents took place • Added territorial information (continents, world regions) to help organize the overall map
Example from NCTC data • About 8000 incidents • 9419 nodes • 18486 edges • Layout is time consuming • Does not provide clue about structure • Filter out incidents with no identified group
Example from NCTC data • Interactivity • « Play » with network • Apply various metrics • Attribute-based node filtering • Tulip Graph Viz Framework • Opensource • Plug-in architecture • www.tulip-software.org
Massive data • Information big bang - Projet « How much information », Berkeley University • In 2001, about 1 exabyte (1 million terabytes) of data is generated annually worldwide, including 99.997% available only in digital form • In 2003 : each individual produces about 800 megabytes per year
Massive data • 100 million FedEx transactions / day • 150 million VISA transactions / day • 300 millions long distance calls / day over ATT’s network • 35 billions e-mails / day over the world • 600 billions IP packets / day over DE-CIX backbone Keim, VIEW Workshop 2006
Visualization and Moore’s law Daniel Keim - Keynote Address, VIEW 2006
Visualization and Moore’s law • Issues that won’t be solved by hardware only • Design interaction together with visualization • Understand how and why visualization pays • Collaborate with other fields • Integrate visualization together with other technology NIH-NSF Visualization Research Challenges Report, 2006
Added value of visual and interactive mining • KDD Panel « The Perfect Data Mining Tool » [Ankerst 2002] • The human eye is an excellent tool for spotting natural patterns • Getting rid of the human in the loop? Wrong decision! • Increase human participation through visualization in the data exploration and knowledge discovery processes
« Sense making loop » J. Thomas – Visual Analytics Initiative
« Visualization mantras » • Visual Information Seeking Mantra • Overview, Zoom-in / Filter, and Details on Demand (Shneiderman, 1996) • Visual Analytics Mantra • Analyse first, Show the Important, Zoom, filter and analyse, Details on demand (Keim 2006)
Visualization “pipeline” • A designer’s view on the visualization process
Visualize? Protein interaction network (yeast); Barabàsi 2000
Organize data prior to visualization • Layer or hierarchize data based on: • node/edge metrics (eigenvalues, centralities, …) • topological feature detection • Use relevant drawing methods • Combine with interaction
Case study: ITA 2000 passenger air traffic • Cities connect through direct flights • Edge weights: number of passengers • Questions: • Read motivations of carriers through organization of the network? • Territorial logic? • Political? Economical?
Case study: ITA 2000 passenger air traffic • Cities connect through direct flights • Edge weights: number of passengers • Questions: • Read motivations of carriers through organization of the network? • Territorial logic? • Political? Economical?
TopoLayout – (Topological) Feature-based Hierarchization • Search the graph for components of growing complexity • Subtrees • Biconnected components (« blocks ») • Grid-like • « Clusters »
TopoLayout – (Topological) Feature-based Hierarchization • Search the graph for components of growing complexity • Subtrees • Biconnected components (« blocks ») • Grid-like • « Clusters »
TopoLayout – (Topological) Feature-based Hierarchization • Search the graph for components of growing complexity • Subtrees • Biconnected components • Grid-like • « Clusters »
Search the graph for components of growing complexity Subtrees Biconnected components Grid-like « Clusters » Need to identify articulation points (“pivots”) The graph builds into a “tree of biconnected components” TopoLayout – (Topological) Feature-based Hierarchization
TopoLayout – (Topological) Feature-based Hierarchization • Search the graph for components of growing complexity • Subtrees • Biconnected components (« blocks ») • Grid-like (eigenvalues) • « Clusters »
TopoLayout – (Topological) Feature-based Hierarchization • Search the graph for components of growing complexity • Subtrees • Biconnected components (« blocks ») • Grid-like (eigenvalues) • « Clusters »
TopoLayout • Components naturally organize as a hierarchy through the search process
TopoLayout + interaction: Grouse • Explore the graph by unfolding/folding the hierarchy • The user’s navigation triggers layout of components • Higher level graphs (quotient graphs) are built from metanodes • Improve readability / Less visual elements • Faster layout, based on topology of quotient graph • Grouse
TopoLayout + interaction: Grouse • Multilevel hierarchy: recursive grouping of metanodes
TopoLayout + interaction: Grouse • Multilevel hierarchy: recursive grouping of metanodes
TopoLayout + interaction: Grouse • Multilevel Hierarchy for Abstraction: Cut
Multilevel navigation of small world networks • Small world networks: social networks, web graphs, transportation networks (ITA), … • Small world networks organize into several levels (hierarchy) [Adamic, Huberman] • Idea: capture the hierarchy and use it as a navigation paradigm
Small world networks • Centralities • Bottleneck passageways • Network organizes around those « pivots » nodes
Small world networks • Centralities • Betweenness centrality has high computational cost (global) • Betweenness centrality • Eigenvalue centrality • Prefer local index • Degree • Edge strength
Wuv u e v Mu = Nu\Nv Mv = Nv\Nu Small world networks • Edge strength: proportion of cycles containing an edge (length 3 and 4) (Jaccard 1912) (Tanimoto 1958) Auber et al. 2003 Raddichi et al. 2004
Wuv u e v Mu = Nu\Nv Mv = Nv\Nu Small world networks • Edge strength • Costs linear time if degree is bounded, otherwise quadratic …
Wuv u e v Mu = Nu\Nv Mv = Nv\Nu Small world networks • Edge strength • Cost yet lower than most centralities (local versus global indices) • Incremental: local modification of graphs require local recomputation
Community structure of small world networks • Filter out weak edges • Capture components • Infer quotient graph (metanodes) • Recurse over each component
Community structure of small world networks • Filter out weak edges • Capture components • Infer quotient graph (metanodes) • Recurse over each component
Community structure of small world networks • Filter out weak edges: • Q. What threshold to choose? • A. Best possible one (!) • Use quality criteria • MQ (modularity quality)
MQ(C; G) = C1 C1 … … Cp Cp C2 C2 “Quality” criteria MQ • C = (C1, C2, …, Cp) is a clustering of a graph G
MQ(C; G) = MQ / Nice properties • MQ varies over a bounded interval [-1, 1] • MQ behaves like a Gaussian distribution