Situational Business Intelligence

Situational Business Intelligence Volker Markl Technische Universität Berlin

Agenda • Traditional Business Intelligence • Next Generation Business Intelligence • Building Blocks • Cloud Computing, Map-Reduce, andHadoop, Piglatin • UIMA, SocialTagging • The Long TailofSituationalApplications • Situational Business Intelligence • Challenges

Traditional Business Intelligence

How Did We Get Here? Actual and forecasted BI tools software revenue as reported by IDC BI over Text Web enabled Business Intelligence Client Server Business Intelligence Query/Reporting OLAP Batch Reporting Source: IDC Source: Gartner

2008 CIO Priorities 1 2 3 4 5 6 7 8 9 10 2008 CIO Technology Priorities Rank 2008 Rank 2007 Rank 2006 2008 Increase* To what extent will each of the following technologies be a Top 5 priority for you in 2008? 11.20% Business Intelligence Applications Enterprise Applications (ERP, SCM, and CRM) Server and Storage Technologies (Virtualization) Legacy Application Modernization Security Technologies Technical Infrastructure Networking, Voice, and Data Communications (VoIP) Collaboration Technologies Document Management Service-Oriented Technologies (SOA and SOBA) 1 2 5 3 6 8 4 10 9 7 1 ** 9 10 2 12 8 4 ** 6 8.02% 8.45% 5.79% 8.53% 4.67% 6.83% 7.75% 7.91% 6.71% * Unweighted average budget change Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008 ** New question for 2007

What are CIOs missing? Better/more information 22.9% Faster/quick retrieval 14.3% Accurate/updated data 11.4% Consistent platform 8.6% Better integration 8.6% Standardization 8.6% Other single mentions 40.0% Please give me an example of how your business intelligence solution could better meet your organizations main objective? Source: Business Intelligence Survey, IDC

Next Generation Business Intelligence Internet Text Text Text Who isleading in American Idol? Intranet Information Extraction Semantic Integration Load/Refresh or ad-hoc Text Text XLS Who arethebiggestplayers in the Linux market? Analysis Schema andEntities Text Text XML Whichinsurancepolicycustomersareatriskofbeinghitby a currentstorm? Data WarehouseData Marts The next generation of Business Intelligence (NGBI) correlates data warehouses with text and semi-structured data from webservices of corporate intranets and the internet

Answering a NGBI Query Who are the biggest players in the “Linux” market? Web 2.0 documents from 332 Wiki News docs (January –March 2007)

Data Source Identification • Data Warehouse • Masterdata • Information Providers • Information Marketplaces • Crawling (Internet/Intranet) Data Fusion Atomic Entity extraction Data Cleansing Schema extraction Data Source identification

Atomic Entity Extraction Data Fusion Atomic Entity extraction Data Cleansing Schema extraction Data Source identification Out-of-the box data • Web Services for complex, atomic and named entities Frameworks • Infrastructures for extracting, managing and scalable storage of named entities • Web Services for extracting named entities Basic Components • Screen scraper Additional extraction and data cleansing effort

Ad hoc analysis process Data Fusion Atomic Entity extraction Data Cleansing Schema extraction Data Source identification

Schema Extraction Data Fusion Base extraction Data Cleansing Schema extraction Pre Process Company Technology ->Technology Company Technology -> Company

Data Cleansing Data Fusion Base extraction Data Cleansing Schema extraction Pre Process Duplicates

Data Fusion Data Fusion Base extraction Data Cleansing Schema extraction Pre Process Data Source A Schema Mapping Apple iPhone 3 Gen 299.95 Information Integration Duplicate Detection match max length min Apple iPhone 3 Gen 199.99 Data Fusion Apple iPhone 3G 199.99 Data Source B e.g., Hummer (U Potsdam)

Data Fusion Data Fusion Base extraction Data Cleansing Schema extraction Pre Process b c - a a b c d Integration of complementary tuples b d a - b - a - Elemination of identical tuples b - - a b - - a b c - a Elemination of subsumed tuples a b c - b - - a b c - a Conflict resolution f(b,e)‏ a c d a e - d

Address Uncertainty: Query Refinement • Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN..)‏ • “Everything” about Dell? • The market of “Linux” from 2007-2008? • “What's the average analyst quote about the IBM stock price for the last month?” • Drill down on region, time, organization …. U QUERY S S U DATA

Building Blocks • Cloud Computing • Map Reduce • Pig • UIMA • Social Tagging

Cloud Computing • What is Cloud Computing? • Computing platform architecture • Scales to any application • High fault tolerance • No generally accepted definition available • Separation from Utility or Grid Computing is not obvious

Cloud Computing • How does Cloud Computing work? • Lots of loosely coupled computers • Use of commodity hardware • Flexible up- or downgrading of resources • APIs offer access to cloud computing systems • Software takes care of parallelization, hardware failures and error handling • Resources (e.g. storage, computing power) can be bought as services (paying for usage, e.g. Amazon)

MapReduce – Programming Model • Program logic is split into 2 functions:Map(k,v) and Reduce(k,list(v)) • Functions receive and produce (Key, Value)-pairs • Map(k,v) computes for each (k,v)-pair an intermediate (ki,vi)-pair • Reduce(k,list(v)) merges all values with the same key k and outputs the result. • MapReduce programs are easy to develop • Frameworks provide libraries • Frameworks take care of parallelization, distribution and error handling • Only application specific source code is required (no parallelization and error handling code)

MapReduce – Group AVG Example Input Data MAP(k,v) Intermediate (K,V)-Pairs REDUCE(k,list(v)) Result (US,10) (US,40) (US,10) (US,40) (GB,20) NewYork, US, 10 LosAngeles, US, 40 London, GB, 20 Berlin, DE, 60 Glasgow, GB, 10 Munich, DE, 30 … (DE,45) (GB,15) (US,25) (GB,20) (GB,10) (GB,10) (DE,60) (DE,30) (DE,60) (DE,30)

MapReduce • MapReduce Programming Model • For processing of huge amounts of data • Massive parallelization of computing tasks • Applicable to many real world applications • MapReduce programs are easy to implement • MapReduce Engine • Environment to run MapReduce programs • Distributes computing tasks • Errors are transparently handled • Very scalable architecture • Examples: Google MapReduce & Apache Hadoop

Hadoop • What is Hadoop? • Free software framework for data intensive applications • Enables distributed processing of vast amounts of data on cloud computing architectures • Supports clouds with 1000+ nodes • Two components: • Hadoop Distributed File System (HDFS) • MapReduce Engine • Where can you get Hadoop? • Top-level Apache Project: http://hadoop.apache.org/core/

Hadoop - HDFS • Inspired by Google File System • Distributed storage for large files • Files are split up in multiple parts (default size 64MB) • Parts are spread over the HDFS nodes • Each part replicated (default 3 times)

Hadoop – MapReduce Engine • Runs MapReduce programs • Libraries for Java and C++ • Assigns Map and Reduce tasks to computing nodes • Reduction of data transfer volume • Tasks are assigned to nodes holding the data • Node failures are transparently handled • Tasks are restarted on node holding a replica of the data MAP( ) MAP( ) MAP( ) FAILS! MAP( ) TaskManager MAP( ) …

Hadoop • Who uses Hadoop? • Amazon A9.com (Search Index Building, Analytics) • Facebook (Logfile Analysis) • Google & IBM (University Initiative to Address Internet-Scale Computing Challenges) • Yahoo! (Crawling, Indexing, Searching) Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds • And many others… (see http://wiki.apache.org/hadoop/PoweredBy) • Hadoop resembles Google‘s MapReduce Framework • J. Dean, S. Ghemawat„MapReduce: Simplified Data Processing on Large Clusters“

The Pig Project • A platform for analyzing large data sets • Pig consists of two parts: • PigLatin: A Data Processing Language • Pig Infrastructure (Grunt): An Evaluator for PigLatin programs • Where can you get Pig? • Apache Incubator Project: http://incubator.apache.org/pig • Alternatives: • HIVE (Facebook) • JAQL (IBM Research)

PigLatinData Processing Language • PigLatin is imperative (whereas SQL is declarative) • Step-by-step programming approach • PigLatin queries are easy to write and understand • Fully nestable data model • Atomic values, tuples, bags, maps • Operators of two flavors: • Relational style operators (filter, join, etc.) • Functional-programming style operators (map, reduce) • Easy to extend by user functions • Example: “Find the top 10 most visited pages in each category” visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreachgVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreachgCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008

Pig Infrastructure • Currently two modes: • Local: PigLatin programs are locally evaluated (run in a single JVM) • MapReduce: PigLatin programs are compiled to sequences of MapReduce programs and executed (e.g. on Hadoop) • Example: Map1 LOAD visits GROUP BY url Reduce1 FOREACH url GENERATE count Map2 LOAD urlinfo JOIN on url Reduce2 Map3 GROUP bycategory Reduce3 FOREACH category GENERATE top10(urls) Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008

UIMA

UIMA Pre-Processing Analysis Phase Post-Processing

UIMA • Annotators for Part of Speech detection, Named-Entity detection and Relation detection.

The Stratosphere Project • Many BI queries exceed the capabilities of today‘s BI systems • „ Who arethebiggestplayers in the Linux market?“ • „Whichinsurancepolicycustomersareatriskofbeinghitby a currentstorm?“ • The Internet offers valuable information • Enterprise announcements and public business reports • User generated content: Blogs, Wikis, Reviews, Comments, etc. • News websites and feeds • Next Generation Business Intelligence (NGBI) requires joint analysis of internet and enterprise data • Internet, Intranet, Data Warehouse and Local Data must be processed Goal of the Stratosphere Project is to build a NGBI System on a Cloud Computing Platform

Stratosphere - Architecture Further data sources: Internet Data Warehouse Intranet Office documents (spreadsheets) Email Computing Cloud Retrieve HADOOP Query Plan Cache Crawl Scan Extract (UIMA) Extract Process Filter Join Group UI QueryTranslation Query Result

Stratosphere – Research Challenges • Definition an algebra for expressing NGBI-queries • Includes: traditional database operators, data retrieving operators, information extraction operators, and information integration operators • Implementation of NGBI query operators • Requirements: highly-scalable, robust, self-tuning • Leveraging Hadoop and map-reduce-frameworks • Implementation of a cloud computing monitoring infrastructure • Enabling for self-tuning NGBI-operators

Related Project: DBLife

Related Projects: Avatar Email Search

Situational Business Intelligence Example Which insurance policy customers are at risk of being hit by a current storm? Severe weather – Meet Pete, an insurance agent in Lousiana.1. He sees a news report of a severe storm. What is the company’s risk?2. Pete has an Excel spreadsheet with all policy holders he manages, which he filters to select only properties insured for more than $250,000. 3. Pete searches for a website that can predict flood levels for his area and finds www.floodlevels.com, a mashup which predicts the flood level for a geographic area based on USGS flood level forecasts, and GIS databases from4. Pete connects his spreadsheet to www.floodlevels.com 5. He then forwards a risk summary to executives. (Zipcode) (HUC = Hydrological Unit Code) (Geocode = Latitude/Longitude) (Geocode = Latitude/Longitude) edc.usgs.gov/ http://water.usgs.gov/waterwatch/ http://www.dotd.florida.gov/

Flood Risk Assessment Mashup Report Mashup Search Standardization Screen Scraping Standardize www.floodlevels.com Lineage standardize policy XLS water.usgs.gov edc.usgs.gov dotd.louisiana.gov

Situational BI Evolution SCA Portals IT Dept MissionCritical DataMart DataWarehouse Line of Business BestEffort,AdHoc New InitiativesProof of Concept Mashups Limited Time, Immediate Lots of Time

Select Literature (Algebraic) Extraction • Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670-2676 • Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933-942 Schema generation from extracted uncertain data • Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. WebDB 2005: 139-144 • Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: iTrails: Pay-as-you-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663-674 Optimization • Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data Engineering (ICDE) 2008: 636-645 BI over text • Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data Engineering (ICDE) 2008: 636-645 • Raghu Ramakrishnan and Andrew Tomkins: Towards a PeopleWeb. IEEE Computer 40(8): 63-72. • Web 2.0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum 25/2008 • T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams. PODS 07

Conclusion • BI over text will tap into a huge set of additional information for BI • The next generation of business intelligence applications will utilize technologies for scalable processing and service computing to integrate data sources from warehouses, intranet, and internet • Situational BI will create ad-hoc applications to answer complex questions over integrated data sources • Open research problems: • Which is the right extraction service? • “How much” schema can be generated? • “How much” optimization has the user to add? • How to optimize UIMA based extraction plans on a HADDOP cloud? • What is a suitable query language over HADOOP? • Data cleansing, completion, and Duplicate detection of extracted data? • Data explanation: Lineage but also: Why I do NOT see that data tuple?

Acknowledgements • Discussions at IBM Research and IBM SWG • Anant Jhingran • Hamid Pirahesh • Kevin Beyer • David Simmen • Mehmet Altinel • et al. • My team at TU Berlin • Alexander Löser • Fabian Hüske • Stephan Ewen • Helko Glathe

Hindi Thai Traditional Chinese Gracias Spanish Russian Thank You Obrigado English Brazilian Portuguese Arabic Danke German Grazie Merci Italian French Simplified Chinese Tamil Japanese Korean

Situational Business Intelligence

Situational Business Intelligence

Presentation Transcript

Business Intelligence

Business Intelligence

Business Intelligence

Business Intelligence

BUSINESS INTELLIGENCE

Business Intelligence

Business Intelligence

Business Intelligence

BUSINESS INTELLIGENCE

Business Intelligence

Business Intelligence

Geodynamic Situational Awareness for Intelligence Operations

Business Intelligence

Business Intelligence

Business Intelligence