320 likes | 503 Vues
Data Sciences at Sandia National Laboratories. Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana-Champaign. Approved for UUR: SAND2014-17938 PE. Motivation Slide. 2. 4. 4. 2. 9. 24. 39. 24. 18. 37.
E N D
Data Sciences at Sandia National Laboratories Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana-Champaign Approved for UUR: SAND2014-17938 PE
Motivation Slide 2 4 4 2 9 24 39 24 18 37 Mission Context • The analysis of data is a common point of success or failure • We see exponential growth in the size, complexity and information density of datafar exceeding available human analysis capabilities. • This research challenge intersects most Sandia Mission Areas. Why Sandia? • Envision, create the future • Pathfinder, data-centric systems for high-consequence decision making: • Sandia delivers sensors into several mission areas. • Sandia has access to relevant data sets and analysts. • We live the full data science cycle at the cutting edge as part of executing our broad national security missions.
Contributors to current limitations: • Only a small fraction of the data is ever examined by analysts. Workflows are labor intensive and devoid of effective computational tools. • Systems do not exploit the relationship discovery potential of the data or identify meaningful, defensible trends and patterns.
Analyst (2017) Increasing Value Decreasing Volume Analyst (2014) • Sensors • Increasing density • Increasing data rates • Increasing information density
Focus: High Consequence Decision Making Human Analyst Centric – Transforming Information into Actionable Intelligence • Discovery and Disambiguation • Patterns of Life: anomaly detection • Patterns of Life: predictive analytics • Robust Data Analysis: incomplete, vulnerable and uncertain data • Intelligent Data Collection: tasking of sensors, optimal data sets Crosscutting Capabilities: Fundamental mathematics and science transformed into advanced mission capability Sandia World Class S&T: Research Foundations Supporting National Security Missions: Nuclear Weapons, Defense & Intelligence
Research Portfolio – An Investment in Next Generation Analytics for the Nation • Portfolio focus on innovative R&D • Foster high-risk projects • Promote R&D impact on mission stakeholders • Current portfolio • PANTHER Grand Challenge – Pixels to Intelligence: Pattern Analytics • Counter Adversarial Data Analytics -internal research & development • Adversarial Modeling - internal research & development • DARPA sponsored graph modeling • Customer sponsored streaming/graph algorithms • Customer funded cyber defense
Grand Challenges – Internal R&D • Laboratory Directed Research & Development (LDRD) • Sandia initiates 1-2 GC LDRDs/ yr • Cross laboratory, interdisciplinary • High risk, urgent needs, low TRL (technology readiness level) • Three year effort • Create a Lasting National Technical Capability • Required to assemble highly engaged external advisory board • Cooperative development with partners to shape technical vision
PANTHER Grand ChallengePattern Analytics to Support High Performance Exploitation and Reasoning: Crosscutting Innovation in Geospatial-Temporal Analysis PANTHER R&D: Scalable, temporal and geospatial relationship discovery for robust, mission- relevant pattern analysis. How is PANTHER different? Develop new mathematical approaches to this problem with a deep understanding of human capabilities and information landscape. • Challenge Problem Categories: Signature Search – enable searches for signatures that are difficult to discretize and subject to interruptions in space and time. Motion & Trajectory Analysis –discover patterns in motion datasets at multiple semantic scales under conditions of intermittent data.
Panther Team - Interdisciplinary PI: Kristina Czuchlewski, PhD PM: Bill Hart, PhD Team Leads: Jim Chow, Ph.D., Randy Brost, Ph.D., Laura McNamara, Ph.D. and David Stracuzzi, Ph.D.
R&D Challenge Questions Signature Search Trajectory Analysis Temporal Patterns Limited Data Uncertainty/Quality Where are chemical processing plants? Where are active businesses? Did someone arrive in a car and enter a building? Which one(s)? Which aircraft flights are point-to-point? Are any aircraft flying search patterns? Are any aircraft flying search patterns over sensitive locations? Is there an activity surge? Is an aircraft flight pattern departing from normal? What might it do next? What can we infer given limited data? For any of the above, what is the result confidence?
FY14 Highlights • Unsupervised computational methods for detecting outliers (with no a priori knowledge) within a TB-scale database… in under 30 minutes. • New geometric feature vector enables comparisons. • Result: 44 seconds on our Netezza DB machine to compute a trajectory segmentation from points. • Result: 20 minutes of wall-clock time to turn 1.2 billion points into 15 million trajectories. • Geospatial feature extraction and classificationfor signature relationship and temporal trending searches. • Pixel statistics extracted via new superpixel segmentation algorithm. • Unique temporal attributes of coherent changes exploited for labeling.
FY14 Highlights, cont. Shadow Bright Roof • New, efficient graph algorithm formulation developed to enable geospatial-temporal topological search complexity. • Multi-source data search under one framework • Representation and search under heterogeneous temporal conditions (ephemeral and activity). • Complex sensor feature data “compressed” for analysis with pointers back to original sensor source. • Intermittency/ interruption nodes implemented. • Eye-tracking experiments are enabling visual cognition/ search models – and eventual user interface design(s). • Pattern match quality, statistical and probabilistic approaches are under investigation for characterizing uncertainty in geospatial temporal graph representations.
GeoSpatial Semantic Graph Representations of Features & Activity • Graph includes activity: • @ t=1, the graph includes objects with location • From t=1 to t=2, the graph encodes change • Nodes for activity events.1 • Node attributes include time observed. • No persistence expected. • Spatial and temporal relationship edges.1
Signature Search Grass Building Pavement Dirt • A signature (over space/ time) encodes a desired question. • For example, “Where are buildings with nearby grass, pavement, and dirt? • Graph template:
Matches • Graph search finds all matches to signature template. • In this case all red nodes with adjacent green, grey, and tan nodes. • New approach to graph search, polynomial time complexity • Searches are saved
Efficient representations in time (only store change). Relationship, change, and temporal analysis over multiple times and heterogeneous spatial ensembles in the same query. Change detection Activity characterization Efficient search algorithms – computation not limited by brittle relational databases. Potential to take full advantage of graph topology search advances, enhanced by geospatial-temporal semantics. Feature-based analysis Multi-modality, in a single search representation. Sensor agnostic – PANTHER emphasizes SAR. Advantages
Extracting Statistical Distributions Using Superpixel Segmentation Superpixel Segmentation of SAR Image • Superpixel Segmentation • Divides image into compact regions containing pixels similar in spatial proximity and intensity • Derives from large body of research in optical image processing community extended to SAR imagery • PANTHER Approach • Superpixel segmentation algorithms enable high quality segmentations and efficient execution • De-noising advances enable utilization of novel image processing techniques.
Initial Automated Static Feature Extraction Result Exploit ~21 days worth of change via coherence Result: automatically extract paved roads
Query algorithms must handle disconnections & interruptions Data semantics: Building Fence Dirt Road Gravel Desert Shadow Low Vegetation Human Tracks Vehicle Tracks Vehicle B F DR G D S LV HT VT V Did someone arrive by car and visit a building? Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. Static and ephemeral features for time slice A:*
Example: Connected Signature Correct Correct Correct Correct Why wasn’t this found? Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. Result from star-graph algorithm:*
Result of Disconnected Signature Search Correct Correct Correct Correct Correct Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. Result from star-graph algorithm:*
Diversity of Problems Site Activity Analysis Tank Complex Search Power Plant Search Activity Analysis -- Interrupted Signature Construction Analysis All of these were solved by the same code.
Discovery of Flight Patterns Holding Pattern Mapping Avoid Collections of geometric descriptions can describe a trajectory. Extensions: impact of time. Forgot Something
Big Feature Space Advantage Forgot Something, Revisited Did not specify “find this,” only told routine to “make groups of similar flights.” This was one of many clusters that had distinctive shapes Clustering, leading to unsupervised learning techniques Previous examples showed searching the space for a specific pattern or a specific volume of the feature space But, with clustering, the computer can group the different patterns in the feature space without knowing a priori what they are. Perhaps most importantly, many clustering algorithms specifically identify outliers in the feature space that correspond to odd behaviors
Discovery of Odd flights • Note: we have ~5M points/day, ~1GB/day, currently >300GB Represents approximately 700 out of a total of 50,000 flights from one day Clustering done based on geometric features Many clusters found, but what remains is…
Better knowledge & deeper insights from Big Data in minutes, not months; over months, not hours or minutes; covering hundreds, not 10s of km2
Technical Areas for Collaboration Graph Analytics Tensor Analysis Computational Geometry Digital Signal and Image Processing Machine Learning Scalable and Highly Distributed Architectures Human Performance and Cognitive Understanding Machine Feature Identification in Imagery (EO, SAR, LIDAR) Uncertainty Quantification and Propagation – from Sensor to Answer Robust Data Analysis Intelligent Data Collection
University Collaborations • PhD Interns: Colorado State University, Utah State University • Faculty have clearances and clear expertise in critical areas • Students can gain clearances • Year-round appointments give flexibility for Sandia work, university requirements • Win-win-win: Sandia gains valuable technical support in critical challenge areas, Student gets a degree/publications/potential future employment, University and professor fulfill mission and gain a strong supporter. • Critical Skills Master’s Program (CSMP): University of Illinois • Similar to old One-Year-on-Campus-Program • NDA’s: USU, CSU, UIUC (in progress)