1 / 19

An Overview of Data Analytics at DIMACS and DyDAn

An Overview of Data Analytics at DIMACS and DyDAn. Paul Kantor Fred Roberts. What is DIMACS?. DIMACS was formed as an NSF Science and Technology center in 1989 to foster research & education programs at the interface between discrete math and theoretical computer science

haven
Télécharger la présentation

An Overview of Data Analytics at DIMACS and DyDAn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of Data Analytics at DIMACS and DyDAn Paul Kantor Fred Roberts

  2. What is DIMACS? • DIMACS was formed as an NSF Science and Technology center in 1989 to foster research & education programs at the interface between discrete math and theoretical computer science • Built around multi-year themes called “special foci” • Host related workshops and education programs and lead related research projects • Primary of areas of research: discrete math & theoretical CS and their applications; interfaces between mathematics and biology; homeland security • DIMACS has both industry and academic institutional partners and nearly 300 affiliated scientists • Many of the world’s leaders in discrete mathematics and theoretical computer science and their applications • Statisticians, biologists, psychologists, chemists, epidemiologists, and engineers • None are paid by DIMACS, but they join in DIMACS projects

  3. A Selection of DIMACS Projects • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data • Communication Security and Information Privacy

  4. What is DyDAn? • DyDAn is a DHS Center of Excellence for research on advanced methods for information analysis • Established as one of four centers for research in “discrete sciences” • DyDAn serves as coordinator of the 4 centers, based at Rutgers, University of Illinois, USC, and University of Pittsburgh • DyDAn is based at Rutgers and has 5 university and 2 industry partners • 40+ researchers in fields of mathematics, computer science, statistics, operations research, engineering and biology • DyDAn is based at DIMACS DyDAn is developing novel technologies to find patterns & relationships in dynamic, nonstationary, massive datasets

  5. DyDAn Researchers Work On: • Counter-terrorism • Intelligence analysis • Disease surveillance (natural/man-caused) • Customs and border protection • Law enforcement • Data management in emergency situations • Nuclear detection/sensors • Image, audio, text, analysis Avian flu Containers for Inspection We hope to make DyDAn an informatics resource for the homeland security enterprise

  6. Detection Contexts: Our Approach: Develop mathematical models specific to nuclear detection Key Problems: Borders & Seaports Sensor diagnostics: Detect malfunctioning sensor Use archival data for diagnostics Locating Sensors Interpreting real-time data: Manage false alarms & risk Leverage existing methods: Special Events Tracking moving radiation sources • Bayes classifiers • Approx. dynamic programming • Sensor location in other domains • Data sampling • Sequential diagnostics • Image analysis Moving sources Project: Sensor Management for Nuclear Detection

  7. Project: Universal Information Graphs • A variety of different massive data sources are available to analysts: Web, Internet, Calls, Email, Transportation, … • Problem: Coordinate information from multiple sources to identify “interesting” collaborative information networks • Model each data source as a large multigraph, but there will be too much information to actually fuse all these multigraphs into one... • We want to “virtually fuse” these disparate multigraphs: • Develop computationally-efficient node rank functions (as in Web search ranking) • Develop linkage metrics between nodes to understand patterns of communication • Approximate linkage metrics with limited time and space resources. • Hierarchy Tree tools developed by team members offer a uniform method for large data exploration. Particularly well suited for External Memory Graphs. • I/O and screen bottleneck are handled uniformly. • Hierarchical slices allow the incorporation of different data types.

  8. Project: Monitoring Message Streams • Algorithmic Methods for Automatic Processing of Messages • Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events • Components of automatic message processing: 1) text compression 2) text representation; 3) matching scheme; 4) learning method; 5) fusion scheme • Project Premise: Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text data • Our approach is to develop/explore methods for each component and then to combine them • In the first phase of the project, we did over 5000 complete experiments with different combinations of methods • Nearest neighbor • Bayesian methods - the Bayesian Regression software we developed constitutes the most efficient software in the world for ultra-high dimensional Bayesian logistic regression.

  9. MMS: Goal Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition) Emphasis in this phase of project: Entity Resolution

  10. MMS Key Goal: • Produce entity resolution module that is • robust, general, well-founded • based on our current Bayesian logistic regression framework • can be integrated with software for a variety of applications.

  11. Outline • Bayesian logistic regression • Advances • Using domain knowledge to reduce the need for training data • Speeding up training and classification • Online training

  12. Speeding Up Classification • Completed new version of BMRclassify • Replaces old BBRclassify and BMRclassify • More flexible • Can apply 1000’s of binary and polytomous classifiers simultaneously • Allows meaningful names for features • Inverted index to classifier suites for speed • 25x speedup over old BMRclassify and BBRclassify

  13. Rutgers DIMACS KDDMMS: Clustering and Entity Resolution • The intelligence problem • documents • entities (people, organization) • first order associations (between different types) • second order associations (within types) • Example • people (in the role of authors) • scientific publications • research groups - “invisible colleges”

  14. Current ER activity • Join multiple models approach with multiple options presented by the BMR modeling package. • Integrate all methods with online algorithms • Reality checks: synthesized or challenge data • Models for whether a pair of items are “from same agent, or different agents” • Uses BXR to identify salient features for this task, which carry over to agents who have never been seen before.

  15. Collaborative Tools Paul Kantor (LIS) Barry Sopher (Economics) Rutgers AEF Support

  16. Collaboration requires • The right software • the right incentivation mechanims • “mechanism design” • research laboratory at Rutgers • experiments in collaboration • find the right mix • build systems that make the rewards • reliable; prompt; automatic

  17. AntWorld

  18. Potential DOD applications • Intelligence activity • asynchronous collaboration • 24/7 monitoring of open source and SIGINT traffic • sharing of hypotheses and insights • Coordination among warfighters • combine roving information to provide “red/green”indications

More Related