Impliance: an I nformation M anagement Ap pliance

Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee IBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret Swart Almaden Research Center

Agenda • Motivation: Observations  Requirements • What is Impliance? • How is Impliance different from…? • Research opportunities • Conclusions

After all our successes (and last night’s revelry), it’s easy to become self-congratulatory.Sorry, time for…

Some embarrassing questions: • Why is most (>80%) of the world’s data still not in databases • Didn’t we “solve” this problem in the 1980s with object-relational systems? • Do you use a database to store your data on your laptop? • Why not? (You are a database bigot, aren’t you?) • Have you ever tried to query (with SQL) a database that: • You didn’t create, and… • Had more than 500 tables? • Just how easy is it to incrementally add DB capacity beyond 1 machine? 100 machines? • Have “self-managing” databases significantly simplified administration?

ObservationRequirements (1 of 5) Observation #1: Information converging • Many types of data in today’s enterprise • Structured (traditional Data Base) • Semi-structured (traditional Content Management, XML) • Unstructured (text, multimedia) • Each needs a different search interface, today • SQL • JSR-170 • Keyword search / Information Retrieval Requirement #1: Store / Search / Analyze all data • Need to rapidly relate information of different types • With one unified interface! • Real use cases in paper

ObservationRequirements (2 of 5) Observation #2: Awash in data, but not information • Typical complaint: “I can’t find what I’m looking for!” • But just finding data isn’t enough! • Today’s Business Intelligence is too human-intensive Requirement #2: Pro-actively derive useful information • Need to glean more business value from enterprise data • What sort of analytics exploit unstructured data? • Need to automatically extract the semantics of text • A rebirth of data mining?

+ ObservationRequirements (3 of 5) Obs. #3: Total Cost of Ownership (TCO) is paramount • People costs dominate TCO • Hardware often less than 50% of TCO • Minimize Time To Value • Databases take too long to set up! • Wizards & Advisors simply mask complexity, add brittleness Reqmt. #3: System must be simple, robust, & secure • Sacrifice resource utilization for radical simplification of: • Setup / Configuration / Deployment (e.g., Self-Organizing) • Operation • KISS (you know this one) • KIWI – Kill It With Iron [Weikum]! • Example: “Good enough” plans exploiting massive parallelism

ObservationRequirements (4 of 5) Observation #4: Data volumes growing fast • Data is kept longer • Lots of new kinds of data: RFID, email, photos, videos • Disk densities improving, but not seek times! • 1 TB disk for $399 (Hitachi) Requirement #4: Simple & massive scale-out • 1000s of nodes • With low management overhead • No single point of failure

ObservationRequirements (5 of 5) Obs. #5: Today’s Info. Mgmt. software based upon hardware 30 yrs. ago • Example: Update-in-place databases due to expensive disk • Today: Cheap CPUs, large storage, fast networks Requirement #5: Need new (software) architecture • Opportunity to radically rethink Info. Mgmt. software architecture (Stonebraker: “refactor”), based upon: • Hardware economics • e.g., cheap (multi-core) CPUs, storage, memory, network • Software: • Formats (e.g., XML, semi-structured data) • Functionality required (e.g., unstructured search, analytics) • Specified in the right order: • Service requirements  Software  Hardware

What is Impliance? • Administrator-less: • Low Time to Value by Self-Organizing • Low Total Cost of Ownership • Scalable: • Massively parallel scale-out… • …to Petabytes! • Bundled: • HW & SW • Pre-configured • Pre-tuned • Limited APIs Structured Data (Tables) Text XML • Manage and Search All Data: • Structured, Semi-Structured, … • …Even Unstructured Text! • Pro-actively Mine Information: • Glean business insight from data

What Does Impliance Actually Do? • All enterprise information: • Stores & Retrieves (Search / Query) • Composes / Integrates / Mashups • Finds trends & exceptions (Business Intelligence)

Think of Impliance as… • Content Management on steroids (beyond JSR-170) • File System with all content searchable • Data Warehouse with all your enterprise’s data • Not just structured information • Excluding high-rate OLTP (web site) • A Jambalaya

Where does Impliance fit? Content Management Impliance Archiving Products Un - Structured Types of Data Semi- Structured DBMS XML Structured Transaction Ingestion OLTP Warehousing/OLAP Archiving Lifetime of Data

How is Impliance related to… • Google Base? • Primary data store • Appliance (product, i.e., sits in customer site), not a Service • Enterprise, not “the masses” • DataSpaces / Google “Pay as you go”? • Primary data store (vs. lazy federation of existing data sources) • Enterprise, not “the web” • Database “Appliances” (Netezza, DataAlegro, Green Plum, etc.)? • Not just structured (relational) data • Discovery of semantics • More pro-active

Research Opportunities • Reducing TCO – Make categories of administration just GO AWAY • Self-Organizing to obviate database design • Exploit appliance’s limited externalized interfaces • New HW & SW architectures using off-the-shelf components • Achieving fine-grained scale-out • Targetting robust, “good enough” designs • Exploiting integration of components • Data and query models that • Unify all data, yet are simple • Tolerate “schema chaos” • Combine best features of keyword search & SQL • Automated discovery of • Data & query semantics for • Improving precision of queries • Organizing data adaptively • Trends, exceptions, etc. (pro-active Business Intelligence)

Conclusions • We’ve come a long way towards • the autonomic dream • incorporating all data • But we can do much more! • Impliance provides exciting opportunity for DB research • To lower TCO for information management • To exploit today’s hardware and software advances • To rethink information management in a fundamentally new way • Join us!

Hindi Thai Traditional Chinese Gracias Spanish Russian Thank You Obrigado English Brazilian Portuguese Arabic Danke German Grazie Merci Italian French Simplified Chinese Tamil Japanese Korean Impliance – Information Management Appliance

Appendix

Redefining Information Systems -- Players Web 2.0 oriented next generation systems (delivered through services or appliances): • Google, Yahoo, MSN, (IBM) • Google base (a semi-structured/un-structured information base) • Google OneBox • NextGen systems built by integration of successful open source (Green Plum) • Data models: RSS/ATOM/Wiki/… • Architecture: DB+Search+Content systems (e.g., MYSQL+Lucene+Jackrabbit) Entrenched HW/Storage/middleware companies • Storage-driven: • EMC-- Moving up the value chain, brought in a classic Content system • IBM– IDS: synergy between classic CM (JCR) and storage • Server-driven: • Netezza, Datallegro (for BI) • Zantaz (for email compliance) • Data Power (XSLT filtering) • Middleware-driven (IBM, Oracle, Microsoft) • Oracle Secure Enterprise Search

Research Focus 1: Reducing TCO • Make entire categories of administration JUST GO AWAY • Reducing time-to-value through new design principles • Self-organization of “schema chaos” obviates lengthy logical & physical design, REORG • Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc. • New software architecture • Target robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron) • Componentization • Each component simple, robust, and adaptive • Virtual service model • Service Broker optimizes resources and assigns the workload • Exploit integrated hardware and storage systems to provide • Built-in redundancy and availability • Automated backup and archiving (ILM) • Easy cluster management • Schema chaos support at storage level (semantic storage) • Ability to use new types of grid elements (cell blade server) seamlessly

Data Blade Data Blade Data proc Data proc RAID RAID XactionStream Research Focus 2: Scalability TransactionalCluster Analytic Grid • True Grid Model • Off-the-shelf, commodity hardware • Dedicate blades to different tasks • Data: storage and simple filtering • Analytical: aggregation & mining • Transaction: search, transactional get/put • Supports Mixed Workloads • Analytics, Search, Content, … • Fine-grained scale-out • Different blade types scale independently • From SMB to largest enterprises • Integrating modern HW & storage, e.g. • BC3, intelligent bricks • Logic pushdown into storage • Predicate application • Aggregation • Redundancy management TransactionBlade AnalyticBlade DataStream Commodity Interconnect Data Array Data Array ContentStream … Archive/ILM Stream Data+Content+Search+Digital Media

Parallel Run-time: Comparison of Plumbing

Applications JCR content Data/ Query Modeler • Data Analyzer, Discovery, Query: • Large-scale computation • Data Modeler • Simple, generic • SRRS • Fault tolerant • DDS • Provide reliability • VSCR • Commodity HW Data Analyzer Discovery Query SQL Relationaldata XSLT XML HTTP Web page Scalable Reliable Runtime Support Video ArchiveILM … Objects Resource Modeler Distributed Data Store Virtual Storage and Computing Resource Security Control

Research Focus 3: Information Modeling and Querying • Simple, rich, unified information model & associated query languages, e.g. Google Base approach promising • Defined typed attributes for navigation • Defined label for keyword search Infosphere, MUSIC Open community (RSS / Atom / wiki) • Automatic schema discovery and integration – self-organizing! Integrating solutions from Infosphere, CLIO • Intelligence discovery Automatic discovery of semantics (UIMA, Web Fountain, Avatar) Pro-active, continuous mining (vs. passive BI model) Contextual information supply Including reporting and advanced analytics

Eliminate Admin Tasks… …Rather than adding layers (1 of 3): • Special-purpose, turn-key appliances for basic services • vs. today’s general-purpose SW (but still uses off-the-shelf hardware!) • Bundled, Pre-installed, Pre-configured, Pre-tuned software! • Examples: • Information Management appliance • Web Server appliance • Minimizes interfaces user has to worry about • No need to externalize underlying operating system, storage details • Eliminates need to install, configure, and tune • Self-organizing data systems • Automatic discovery of data structure • Obviates need to • Define logical and physical schema a priori, reducing time to value • Migrate schema when organization changes

Eliminate Admin Tasks (2 of 3): • Universal Data Management • Today: • Plethora of special-purpose data managers: • Databases for structured data • Content managers for semi-structured data • File systems for unstructured data • For each, very different • User interfaces (SQL, JSR 170, file interface) • Degrees of semantic knowledge about the data’s contents • Degrees of searchability • Consistency semantics (e.g., transactions) when updated • Management capabilities and interfaces • Tomorrow: Single mechanism for managing all data • Uniform interfaces for all types of data, for • Searching • Updating • Management • Universal indexing (“Google model”) of all data – default search mechanism • Plus more precise searching for auto-discovered (above) structured information • Obviates need to Impose naming conventions to find desired data

Eliminate Admin Tasks (3 of 3): • Robust storage mechanisms to eliminate need for backups • Never throw out data –keep versions! • Update-in-place • Is an anachronism from days of expensive disk • Increases complexity of transactions • Jeopardizes compliance requirements (Sarbanes-Oxley) • Versions permit queries “as of” some time • Exploits storage density increases (relative to number of disk arms) • RAID provides local reliability • Widely accepted and deployed • Weaver Codes extend to multiple simultaneous failures • How provide universal reliability (i.e., against site disasters)? • Selective, automated replication of new versions? • Cross-site RAID? • Universal “Call Home” technology for remote management of • Monitoring • Problem determination • Software maintenance & upgrades

Observation / Requirements • Information converging: Store / Search / Analyze ALL data • Structured (traditional Data Base) • Semi-structured (traditional Content Management, XML, multi-media, call center records) • Unstructured (text) • Same advanced functionality required • Data volume growing fast: On Demand strategy requires massive scale-out • Lots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built) • Data is kept longer, due to compliance requirements • Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart & fragile) • People costs dominate TCO: Hardware often less than 50% of TCO • Hence, sacrifice resource utilization for radical simplification • Delivered in services or appliances • Today’s IM software based upon hardware 30 yrs ago: Need new software architecture • Cheap CPUs, large storage, fast network in hardware • Opportunity to radically rethink IM software architecture, based upon: • Hardware economics (e.g., cheap CPUs, storage, memory, & network) • Data: • Formats (e.g., XML, semi-structured data) • Functionality required (e.g., unstructured search, analytics)

Total Cost of Ownership is the Driver

Actionability Actionability Actionability Heterogeneity Heterogeneity Heterogeneity Scale Scale Scale LifeScience data - protein folding, gene expression: Difficult to analyze but we know where to look Seat on an airplane: Easy to find, structured Satellite and surveillance data: An infinite space of "patterns" Changing Characteristics of Data Transactions and structured data Text and other human data Machine-generated and unstructured data

Impliance: A Highly-Scalable, Rich-Functional Information Management Appliance A box with software pre-installed How delivered to enterprise: appliance or service What functions? • Store and manage all information • accept all types of enterprises data • Deliver all intelligence • Integrate cross silo information • Advanced analytics with richer semantics What properties? • Low TCO • easy to deploy (“plug & play”) • simple and stable • Scalability • From SMB to Very Large (PetaBytes) (Not for high-end OLTP!) Impliance JCR Native retrieval interface content SQL Relationaldata XSLT XML HTTP Web page Native update/ load interface Video ArchiveILM … Data+Content+Digital Media

Impliance: an I nformation M anagement Ap pliance