Lecture 14: Overview of Post-Relational Development

Lecture 14: Overview of Post-Relational Development Oct. 13, 2006 ChengXiang Zhai

New Challenges in Databases Traditional RDBMS Functions Traditional Relational Data Traditional Users New Data/Info Management Functions? New Data Type? New Users?

New Kinds of Data Ranking in DB “Schema Lean/Last” (Semi-structured data model) Complex object indexing Stream data Data mining Data integration Internet computing applications • Text data • Multimedia data • Scientific data • Sensor data • Log data • Personal data • Web/Email/Blog • ...

New Users • Everyone?

New Functions New/More general Data Model/Architecture? (Object-Oriented) New Algorithms Adding intelligence to DB • Information integration • Navigation • Ranking • Pattern finding (data mining) • Decision support

New Computing Environment Distributed DB Peer-to-Peer (P2P) DB Mobile DB? • Distributed computing/Networks (Internet) • Mobile devices (cell phones, PDAs)

Web Changes Everything Observations: • Publishing of data is almost free • many are simultaneously producer and consumer • Web is becoming a huge database • of distributed data online (published by everyone) • of autonomous databases online • Trends: • static HTML pages --> dynamic pages presenting DB • HTML --> XML for better describing structured data Slide from Kevin Chang’s presentation

Web Changes Everything What are needed: • Content producers: • tools for building huge data store • Content consumers: • tools for discovering and querying info. on the web Slide from Kevin Chang’s presentation

Database Technology Timeline Simple Data Management Global Enterprise Management Early 80s Late 80s Early - Mid 90s Late 90s - 21st C EarlyRelational Client-server Relational Enterprise -capable Relational Internet Computing Pre- relational Packaged & Vertical Applications Data Warehouse & Hi-end OLTP Simple OLTP Active Database Middleware (messaging, queues, events) Java, CORBA, Web interfaces Scaleable OLTP, parallel query, partitioning, cluster support, row-level locking, high availability Simple transactions, on-line backup & recovery Support for all types of data, extensibility, objects Stored procedures, triggers Slide from Anil Nori’s presentation

Current State of DBMSs • OLTP applications • Large amounts of data • Simple data, simple queries and updates • Update statement from debit/credit transaction:UPDATE accounts SET abalance = abalance + :deltaWHERE aid = :aid; • Typically update intensive • Large number of concurrent users (transactions) • Data warehousing applications • Large amounts of data • Simple data but complex querying • Typically read intensive • Large number of users Slide from Anil Nori’s presentation

Current State of DBMSs • These applications require: • Large users/transactions • High performance • High availability (7x24 operations) • Scalability • High levels of security • Administrative support • Good utilities Slide from Anil Nori’s presentation

Internet Applications: Challenges Transaction Processing Larger User Populations Trained Self-Service Network Systems Gigabytes Terabytes Independent Integrated Systems Management Usage Batch Immediate Simple Intelligent Operations Hours Importance Local Global Business-Critical Useful Data Warehousing Users Analysts Every Employee Size Slide from Anil Nori’s presentation

E-commerce/Apps Information Management APIs Type Proprietary Open Tabular Heterogeneous Applications Delivery Standalone Integrated Generic Personalized Access Read/write Lots of read-only Content Direct Search Internet Applications: Challenges Site Operation Management Low TCO, Mission Critical Availability Occasional 24X7 Slide from Anil Nori’s presentation

Internet Challenges • Availability • Need near 100% availability • Must be easy to manage • Replication, hot standby, foolproof system? • Scalability • Number of users is orders of magnitude higher • Security • Global users • Managing millions of users • Encryption • Performance • Internet user expectations • Speed vs correctness • (e.g. Search engines vs blade/cartridge/extender • Availability vs correctness Slide from Anil Nori’s presentation

Selected Current Topics • Text Database and Information Retrieval • Ranking in Databases • Data Integration • P2P Databases • Data Warehousing & OLAP • Data Mining • Stream Data Processing • Web Services • Semi-Structured Data (XML)

Today’s Topic • Evolution of data models • Object-oriented DBs vs. Object relational DBs • XML “revolution”

Nine Historical Epochs • Hierarchical (IMS): late 1960’s and 1970’s • Network (CODASYL): 1970’s • Relational: 1970’s and early 1980’s • Entity-relationship: 1970’s • Extended relational: early 1980’s • Semantic: late 1970’s and 1980’s • Object-oriented: late 1980’s and early 1990 • Object-relational: late 1980’s and early 1990 • Semi-structured (XML): late 1990’s to present

Pre-Relational Era • IMS (hierarchical data model): Lessons • L1: Physical and logical data independence are highly desirable • L2: Tree structured data models are very restrictive • L3: It is a challenge to provide sophisticated logical reorganization of tree structured data • L4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this if often hard • DODASYL • L5: Networks are more flexible than hierarchies but more complex • L6: Loading and recovering networks is more complex than hierarchies

Relational Era • Resolution of “relational” vs. CODASYL is settled by • The success of the VAX • The non-portability of CODASYL engines • The complexity of IMS logical data bases • Lessons: • L7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence • L8: Logical data independence is easier with a simple data model than with a complex one • L9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology • L10: Query optimizers can beat all the best record-at-a-time DBMS application programmers

The Entity-Relationship Era • Proposed in mid 1970’s by Peter Chen • Never gained acceptance as the underlying data model implemented by a DBMS • No query language? • Over-shadowed by the relational model? • Looked too much like a “cleaned up” version of CODASYL? • But widely successful for DB schema design • DB design using normalization was “dead in the water” • It was straightforward to convert an ER diagram into a set of tables in 3rd normal form • Lessons: • L11: Functional dependencies are too difficult for mere mortals to understand. Another reason for KISS (Keep it simple stupid).

Extended Relational (R++) Era • Beginning in the early 1980’s • A sizeable collection of papers of the following template: • Consider an application , call it X • Try to implement X on a relational DBMS • Show why the queries are difficult or why poor performance is observed • Add a new “feature” to the relational model to correct the problem • Valuable contributions • Set-valued attributes (e.g., available colors of an item) • Aggregation (tuple-reference as a data type, e.g., supply(PT, SR, qty, price), where “PT” and “SR” are pointers to tuples) • Generalization (inheritance) • Lessons: • L12: Unless there is a big performance or functionality advantage, new constructs will go nowhere.

The Semantic Data Model (SDM) Era • Early 1980’s • Motivation: relational data model is “semantically impoverished” (can’t easily express a class of data of interest) • Define more general classes, allowing multiple inheritance • Most SDMs are very complex, and were general paper proposals • Have the same problems as the R++ work

Object-Oriented (OO) Era • Beginning in the mid 1990’s • Motivation: “impedance mismatch” between relational DBs and languages like C++ • DBs have their own naming systems, data type systems, and conventions for returning data as results • Need conversions between DB conventions and programming language conventions • Like “gluing an apple onto a pancake” • As a result, persistent programming language has attracted much attention

Persistent Programming Language • Characteristics • Variables can represent disk-based data as well as main memory data • DB search criteria = language constructs • Early prototypes (late 1970’s): Pascal-R, Rigel, … • Cleaner than SQL embedding • However, compiler must be extended with DBMS-oriented functionality (not very successful) • No technology transfer

Object-Oriented Data Bases • In the mid 1980’s, C++ triggered resurgence of interest in persistent programming languages • Research systems: Garden, Exodus • Startups: Ontologic, Object Design, Versant • General goal: persistent C++ • Extend C++ as a data model • Any C++ structure can be persisted • Support “relationship” • Application/market domain: engineering DBs • Typically, open a large object (e.g., electronic circuit), process it exclusively and close it. • No need for a declarative query language (only need to reference objects) • No fancy transaction management is needed (one-user-at-a-time) • Performance has to be competitive with conventional C++

Current Status of OODB • Market never got very large (too many vendors competing for a “niche” market) • The OODB vendors either have failed or repositioned their companies to offer something else • E.g., Object Design is now Excelon and selling XML services • Reasons for the failure • For their own market: absence of leverage, no standard, relink the world • For competing with Relational DBs: lack of transactions, low-level record-at-a-time (with the exception of O2, which embedded a declarative language, i.e., OQL into a programming language) • Lesson: • L13: Packages will not sell to users unless they are in “major pain”

The Object-Relational Era • Motivated by the need for handling geographic data • Question: How to extend a relational DB to handle new data type? • The object-relational proposal: add the following to SQL (Postgres): • User-defined data types • User-defined operators • User-defined functions, and • User-defined access methods • Commercially successful: • Postgres->Illsutra (acquired by Informix) • Lessons: • L14: The major benefits of OR is two-fold: putting code in the database (thereby blurring the distinction between code and data) and user-defined access methods • L15: Widespread adoption of new technology requires either standards and/or elephant pushing hard

Semi-Structured Data • Motivation: abundance of semi-structured data, exchange format, … • Early system: Lore • Current standards: XMLSchema, XQuery • Two major points • Schema last • Complex network-oriented data model

Schema Last • Application categories • Rigidly structured data • Rigidly structured data with some text fields • Semi-structured data (need to handle semantic heterogeneity) • Text • Very few examples of the 3rd category • The 3rd category can be converted to 1 and 2.

XML Data Model • XML Records can be hierarchical as in IMS • Have “links” as in CODASYL • Have set-based attributes as in SDM • Inherit from other records as in SDM • And others that are known to be hard to implement • Possible scenarios: • XMLSchema will fail • A data-oriented subset of XMLSchema will be proposed • Repeat the “great debate” • Lessons: • L16:Schema-last is probably a niche market • L17: XQuery is pretty much OR SQL with a different syntax • L18: XML will not solve the semantic heterogeneity either inside or outside the enpterprise

What You Should Know • New developments in databases are mostly driven by new applications • The impact of a technology highly depends on the market (the right time, right environment, …) • Cycles of data models (complex->simple->complex…)

Lecture 14: Overview of Post-Relational Development