Tackling IT Challenges: Insights and Solutions from Industry Expert Michael Stonebraker

So What To Do Next? Michael Stonebraker Adjunct Professor Massachusetts Institute of Technology (stonebraker@lcs.mit.edu)

Where To Find Problems • State of affairs • Interesting industrial problems • Mike’s picks • My whine on XML • Grand challenges

State of Affairs • IT failure rate • Software half-life • No knobs

State of Affairs • ~50-75% of IT projects fail • if we built bridges, our profession would be fired • and the same mistakes are repeated over and over (excessive ambition, rolling specs, bad design, failure to load a large data set early)

What To Do? • We typically don’t teach this stuff • probably because we don’t (can’t) spend any time in industry to figure it out Action item: at the very least read a couple of Robert L. Glass’s books

State of Affairs • Hardware “half-life” is 18 months • Software half-life is 18 years (or more)!

What To Do? • Much higher level design environments • we are stuck at the general purpose programming level (conceivable benefit limited) • workflow and other higher level graphical notations probably a good idea

What To Do? • special purpose languages nice (why are report writers shunned?) • higher level versions of SQL and Xquery • See Informix Visionary for a cool example

State of Affairs • Commercial products are way too hard to use • takes people in white lab coats to get them up and keep them up • Full employment act for DBAs forever

What To Do? • “No knobs” • only buttons are “go” and “stop” • all tuning automatic • index selection is one of the minor ones (buffer pool size, partitioning, log buffer pool size, …) • Error reporting stinks

Interesting Industrial Problems Should Focus Research • BBC • OZ entertainment • Cisco • Akamai • Fidelity My suggestion: NSF should require a letter of support from a CIO with each grant proposal.

Interesting Problems -- BBC • Digitize 50 years of British television creativity • want to serve it up on demand • especially British soccer games • media is wearing out • Random access to 1 Petabyte (or so) • By the unwashed internet 200 million

CNN Variation • On-line digital news editing by 300 news directors • who want to find Monica Lewinsky • and 30 seconds of footage on suffering in Bosnia

What To Do? • Content outlives support for the content format • Automatic content indexing • cannot afford a librarian • Global scale distributed system • Staging and caching • high locality of reference

What To Do? • Query model meets visualization systems • unwashed will not learn Xquery • Rights management • incredibly sticky issue in whole area

Interesting Problem - OZ Entertainment • New theme park near Kansas City • “no lines” • no lost kids • virtual theme park as teaser

What To Do? • Large scale GIS • update intensive! • Large scale triggering problem • alert me if there is a cancellation at X and I am within 300 yards

Interesting Problem - Cisco Systems • Supply chain of 60K suppliers for custom goods • Want to query the transitive closure of this supply chain • can I make 10 more routers next week?

What To Do? • Huge federated system • central metadata a non-starter • no single DBA • global query optimizer a non-starter • Adapters for 1M (or so) legacy systems • how to write them semi-automatically?

Interesting Problem - Akamai • Billing is 95/5 • 5 minute intervals • pay for bandwidth of 95th percentile • 300 Gbytes a day (compressed) of click stream data Biggest warehouses on the planet will soon be click stream data!

Click Stream Data • Customers want to mine their click stream • And Akamai only has a portion of it • i.e. huge distributed data base • Query is “tell me something interesting” • i.e. why are 95% of the shopping carts abandoned? • and not a pile of statistics

Interesting Problem - Fidelity • Financial portal for high net worth individuals • must connect to several hundred Fidelity systems • Customers want to know fairly complex things • i.e. rank my money manager against all value managers for 1, 3 and 5 years

What to Do? • Voice to NL to structured data • voice to NL works in focused verticals (weather, airline schedules) • but this is a pretty broad app • NL to structured data requires some work • put in the joins • look up vocabulary in the DBMS

What to Do? • How to join unstructured data to structured data • tell me the news stories about all stocks which have increased in value more than 10% today

Mike’s Picks • Too much middleware • Akamai for structured data

Interesting Problem - Middleware • Average enterprise has • one (or more) app servers • one (or more) EAI packages • one (or more) ETL packages • one (or more) portal products • one (or more) application packages • and maybe someday a federated DBMS

All of these systems • Contain transformation engines • And often do function activation (app service) • And often have adapters to legacy systems Huge overlap in functionality!!

What to Do? • Consolidate weaker paradigms under stronger ones • e.g. federated DBMS subsumes ETL • OR DBMS subsumes app service Middleware becomes DBMS-centric!

Interesting Problem - Caching • Akamai et. al cache HTML • closer to the browser that wants it • Would be nice to cache structured data • need to cache application that uses the data • and the data

What to Do? • Materialized views are a predefined solution • Nice to have a more dynamic one • Cache (query, answer) pairs?

History Lesson (Codd) • Putting semantics into data order is bad • restricts storage options • hidden meaning bad • Hierarchical representations for data are bad • rewrite the queries when representation changes (data independence) • Complexity is bad

My Spin on XML (XMLSchema) • As a storage format, XML is good for documents not data • Codd’s thinking has not been repealed (order, hierarchy, complexity) • no binary format • in line tags are inefficient • SGML run amok….

My Spin on XML • As an “on the wire” notation, XML is ok for data • but don’t try to move too much stuff • and don’t try to move it too fast • Remember why client-server put in binary movement!

Xquery For Data • Won’t store data in XML • Necessary to design something that is easy to translate into SQL • Alternate syntax for OR SQL • which is much cleaner (// is a user defined function in Informix)

XML Summary • Focus attention on XMLSchema as a document description system not a data description system • Focus Xquery on documents not data W3C use cases do not do this!

OR DBMS • XML is merely this year’s data type • Next year it will be WML or ... • OR is still not finished • query optimization • data base design • physical storage layout

Grand Challenge #1 • Preponderance of web accessible data is structured • much more than “facts and figures” • Construct a system to access “the rest of” the web

What To Do • GUI problem (NL or Vis) • Query notation problem • Discovery problem • how do you “scrape” a structured data web site to figure out the meaning of its data? • Federation problem

Grand Challenge #2 • Everything of material importance is geo-positioned (lojacked) • Construct the mother of all GIS systems • complete automation of supply chains • “where is my wife” (or the closest restroom)

What To Do • Most of the issues in GC #1 • The mother of all triggering problems • The mother of all security/privacy problems

Tackling IT Challenges: Insights and Solutions from Industry Expert Michael Stonebraker

Tackling IT Challenges: Insights and Solutions from Industry Expert Michael Stonebraker

Presentation Transcript