400 likes | 514 Vues
In this session led by Jeff Jonas, IBM's Chief Scientist, we explore innovative technologies and strategies applicable to mass declassification efforts. The challenge of reviewing vast volumes of documents necessitates advanced machine triage systems to manage disclosure risks effectively. This presentation examines the accumulation of context needed for better predictions and insights, highlighting the importance of distinguishing between potential risks and leveraging sophisticated algorithms. Join us for a deep dive into the frameworks and methodologies reshaping the future of information analysis and management.
E N D
Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010
The Ask • What emerging technology or innovative approaches come to mind … which may have applicability to this task? • Use your imagination. What if? • Not talking about any specific products • Not focusing on the widely available COTS/GOTS technologies (OCR, document management, case management, workflow, etc.)
The Problem at Hand • Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs) • Necessitates some form of machine triage • Red: A disclosure risk • Yellow: A possible disclosure risk • Green: No disclosure risk • Reliable machine triage requires substantially better prediction systems • Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required
Background • Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy • 1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA) • 2001/2003: Funded by In-Q-Tel • 2005: IBM acquires SRD • Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities • Affiliations: • Member, Markle Foundation Task Force on National Security in the Information Age • Senior Associate, Center for Strategic and International Studies (CSIS) • Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems • Member, EPIC advisory board • Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body
In Today’s Session • Intro to context accumulating systems • Predictions and data points needed for mass declassification • Strawman architecture • Challenges • Q&A
Contextualization From Pixels to Pictures to Insight Relevance Observations Consumer (An analyst, a system, the sensor itself, etc.) Context
Context, definition of: Better understanding something by taking into account the things around it.
scrila34@msn.com Without Context
Consequences • Algorithms flat-lining (e.g., alert queues) • Enterprise amnesia on the rise • Overwhelmed by false positives and false negatives? You have seen nothing yet • Not enough humans to fix this with brute force • Risk assessment becomes the risk
scrila34@msn.com Job Applicant Trusted Supplier Known Terrorist Stolen Identity Context Accumulation
Puzzle Metaphor Primer • Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors • What it represents is unknown – there is no picture on hand • Is it one puzzle, 15 puzzles, or 1,500 puzzles? • Some pieces are duplicates and some are missing • Some are pieces are incomplete, low quality, or have been misinterpreted • Some pieces may even be professionally fabricated lies • Until you take the pieces to the table, you don’t know what you are dealing with
How Context Accumulates • With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections • Asserted connections must favor the false negative • New observations sometimes reverse earlier assertions • Some observations produce novel discovery • As the working space expands, computational effort increases • The emerging picture helps focus collection interests • Given sufficient observations, there can come a tipping point • Thereafter, confidence improves while computational effort decreases!!!!
False Negatives Overstate The Universe Unique Identities True Population Observations
Counting Is Difficult Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1
The Rise and Fall of a Population Unique Identities True Population Observations
New Record Mark Randy Smith 443-43-0000 DL: 00001234 Data Triangulation Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1
Increasing Accuracy and Performance Unique Identities True Population Observations
“Expert Counting” is Fundamental to Prediction • Is it 5 people each with 1 account … or is it 1 person with 5 accounts? • If one cannot count … one cannot estimate vector or velocity (direction and speed). • Without vector and velocity … prediction is nearly impossible. • Therefore, if you can’t count, you can’t predict.
Mass Declassification Predictions • Whose equity is it? • Machine triage – disposition • Queue prioritization
Using What Data Points? FOR EXAMPLE: • 450M target documents • Dirty words • Previous declassifications • Previous declassification denials • FOIA’s • Intellipedia • Wikipedia • WikiLeaks • Deceased persons • Publically available accounts/facts
Open Source Discovery/Scoring • “Height of Pakistan’s Mufasa missile.” • What is 15.5 meters? • New York Times, Sept 21, 2010, C3 “Pakistan unveils Mufasa 7 Warhead” • Wikipedia: Mufasa_7_Warhead
Mufasa 7 Warhead Open Source Reference FOIA March 2010 Classified – Asserted Dirty Word Context Accumulation
Context Accumulation + Statistics Document Element Total | Declass | Class-Default | Class-Asserted Author: “Billy K” 4503 1600 403 0 Codeword: “Tomatoe” 4818 4600 218 0 Classification: “SI/TK/001” 23 22 1 0 Actors: “Salam Ahmed” 782 700 82 0 Declassification dispositions … becoming a force multiplier. The more human dispositions, the more automated dispositions. Humans Auto Triage 5,000 20 10,000 4,000 100,000 65,000 1,000,000 17,000,000
Policy Questions • What related information is already available in the public domain? • Evidence: Exists in open source • What damage might conceivably result from disclosure and what benefits might ensue • Evidence: Same text already released (by same equity holder)
Strawman Architecture 450M Docs Predictions(*) Feature Extraction & Classification Historical Dispositions Context Accumulation DirtyWords Workflow System Dispositions Etc. (*) Recommendations: Equity of, Disposition, Priority
Another Idea: Crowd Sourcing • Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation? • Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing?
Another Idea: Better Classification • Using the overall declassification platform to assist in proper classification (real-time) • And, better pre-tagging to assist in future auto-declassification
Challenges • Entity extraction is imperfect • Predictions may still not good enough, often enough • Not in English • The user work surface and its distribution • Consequences of an inappropriate release • With super access and super tools, this may call for stronger audit and insider-threat protections • Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013
Closing Thoughts • Contextualization is essential to better prediction • There are not enough humans to ask every question every day • “Human attention directing” systems are critical to the mission • The data must find the data, the relevance must find the user
Worst Case Scenario • Rich context enables better hints for users, results in faster dispositions • Rich context enables improved sequencing of the work
Related Blog Posts Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems Data Finds Data Puzzling: How Observations Are Accumulated Into Context The Fast Last Puzzle Piece Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel How to Use a Glue Gun to Catch a Liar It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You Smart Systems Flip-Flop
Questions? Blogging At: www.JeffJonas.TypePad.com Information Management Privacy National Security and Triathlons
Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010
The Problem at Hand • 450M documents • x5min/document • =2.25B minutes • /60 = 37.5M hours • /2040 = 18,382 FTE’s