1 / 28

EOSDIS Alternate Architecture Study

EOSDIS Alternate Architecture Study. Jim Gray McKay Fellow, UC Berkeley, 1 May 1995, gray @ crl.com 1. Background - problem and proposed solution 2. What California proposed Co workers: Mike Stonebraker: Producer / Director / Script Writer / Propeller Head

wgates
Télécharger la présentation

EOSDIS Alternate Architecture Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EOSDIS Alternate Architecture Study Jim Gray McKay Fellow, UC Berkeley, 1 May 1995, gray @ crl.com 1. Background - problem and proposed solution 2. What California proposed Co workers: Mike Stonebraker: Producer / Director / Script Writer / Propeller Head Bill Farrell: Ramrod and Computer-literate DirtBag Jeff Dozier: Godfather Special effects: Earth Science: Frank Davis, C. Roberto Mechoso, Jim Frew Computer Science: Reagan Moore, Jim Gray, Joe Pasquale Administration: Claire Mosher Writing: Stephanie Sides Prototypes: many-many....people

  2. What’s The Problem? • Antarctica is melting -- 77% of fresh water liberated • => sea level rises 70 meters • => Chico & Memphis are beach front property • => New York, Washington, SF, LA, London, Paris • Let’s study it! Mission to Planet Earth • EOS: Earth Observing System (17B$ => 10B$) • ­ 50 instruments on 10 satellites 1997-2001 • Plus Landsat (added later) • EOS DIS: Data Information System: • 3-5 MB/s raw, 30-50 MB/s processed. • 4 TB/day, 15 PB by year 2007 • Issues • How to store it? • How to serve it to users?

  3. What Happened? • 1986: Mission to Planet Earth • 1989: Bids from Hughes & TRW • 1993: Contract Grant, Public Review: • customers do not want it (tape/mainframe centric) • 1994: Alternate Architecture • Three “outside teams” • Wyoming: Internet 20,000,000 • Maryland: Software Engineering • California: DB centric • One “home team” CORBA & Z 39.50 & UNIX • 1995: Drifting in the Sequoia direction

  4. The Hughes Plan • 8 DAACs (Data Active Archive Centers) = Bytes • (one per congressional district?) • N SCFs (Scientific Computation Facilities) = MIPS • (typically instrument or science teams) • Thin wires among them • 90% of DAAC processing is PULL • building standard data products • fixed pipeline: calibrate, grid, derive • Typical subscriber gets tapes or CDroms • (standard data products) • One “chauffeur” per 10 customers (high ops costs) • Build everything (operations, HSM, DBMS,...) from scratch • CORBA and Z 39.50 is the glue. • Criticism: not evolvable, not open, not online, not useful.

  5. What California Proposed • 0. Design for success: expect that millions will use the system (online) • 1. DBMS centric design automates discovery, access, management • 2. Object relational databases enable • Automate access to data so that the NASA 500, • Global Change 10,000 and Internet 20,000,000 can use system. • Cache popular results, not all results (saves 3x or more) • Compute on demand (saves lots of storage and cpu). • Emphasize pull processing rather than push processing. • Use parallelism to get scaleup. • Do Batch as a data pump • 3 Be Smart Shoppers: • Use COTS hardware/software (saves 400M$) • Just-in-time acquisition (saves 400M$) • Use workstation not mainframe technology (gives 10x more stuff) • Depreciate over 3 years (ends in 2007 with "fresh" equipment) • 4. 2 + N node architecture • 2 Super DAACs for fault tolerance and for growth. • Unify the 2 "big" data storage centers with 2 big data analysis centers. • Allow many “little” peer-DAACs at science/user groups

  6. Meta-Model for Sequoia Proposal • Be technological optimists: • couldn’t build it today, count on progress. • ride technology wave (= not water cooled) • Buy or Seed, do not build. • Use COTS where possible • Fund 2 or more COTS vendors if need product • OR DBMS • HSM • Operations. • Replace people with technology (= OR DBMS): • automate data discovery, access, visualization • DBMS Centric view.

  7. DBMS Centric View • This is a database problem (no kidding)! • This is not • a file system problem (file wrong abstraction) • a rpc problem (CORBA wrong abstraction) • a Z 39.50 problem (Z 39.50 is a FAP). • This is a operations problem • Hierarchical storage management • Network management • Source code control • client-server tools. • You can BUY all this stuff. Fund COTS. • BUILD AS LITTLE AS POSSIBLE

  8. What California Proposed • 0. Design for success: expect that millions will use the system (online) • 1. DBMS centric design automates discovery, access, management • 2. Object relational databases enable • Automate access to data so that the NASA 500, • Global Change 10,000 and Internet 20,000,000 can use system. • Cache popular results, not all results (saves 3x or more) • Compute on demand (saves lots of storage and cpu). • Emphasize pull processing rather than push processing. • Use parallelism to get scaleup. • Do Batch as a data pump • 3 Be Smart Shoppers: • Use COTS hardware/software (saves 400M$) • Just-in-time acquisition (saves 400M$) • Use workstation not mainframe technology (gives 10x more stuff) • Depreciate over 3 years (ends in 2007 with "fresh" equipment) • 4. 2 + N node architecture • 2 Super DAACs for fault tolerance and for growth. • Unify the 2 "big" data storage centers with 2 big data analysis centers. • Allow many “little” peer-DAACs at science/user groups

  9. Design for Success: Expect Lots of Users • Expect that millions will use the system (online) • Three user categories: • NASA 500 -- funded by NASA to do science • Global Change 10 k - other dirt bags • Internet 20 m - everyone else • Grain speculators • Environmental Impact Reports • New applications • => discovery & access must be automatic • Allow anyone to set up a peer-DAAC & SCF • Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average). • => A failure: no one uses the data, in DSS, push is 1% or less. • => computation demand is 100x Hughes estimate • (pull is 10x to 100x greater than push)

  10. Push Processing Pull Processing Other Data The Process Flow • Data arrives and is pre-processed. • instrument data is calibrated, • gridded • averaged • Geophysical data is derived • Users ask for stored data • OR to analyze and combine data. • Can make the pull-push split dynamically

  11. The Software Model: Global View • SQL* is the FAP and API. • Applications use it to access data. • It includes • stored procedures • (so RPC) • GC class libraries • Computation is data driven • Gateways for other interfaces • HTTP, Z 39.50, Corba & COM • TP or TP-lite manages workflow

  12. Automate access to data • Invest in: • Design global change schema. • cooperate with standards groups. • OR DBMS class libraries for GC datatypes • Develop browser to do resource discovery • Community will develop access & vis tools • OR DBMS will do • PUSH processing: triggers and workflow • PULL processing: query optimization. • (some assembly required).

  13. How Well Did SQL Work? • Bill Farrell and others did 30 user scenarios schema, application, SQL, performance • Snow cover, CO2, GCM,... • Avg ad hoc scenario generated about 30% of • EOSDIS baseline processing • => validated PULL over PUSH demand • SQL was indeed a power tool: • Many scenarios became a few simple SQL queries: • Need a spatial & temporal SQL. • Personal view: • It’s great!, much better than Farrell or I expected.

  14. Compute on demand • 90% of data is NEVER used (according to Hughes). • Some data is used only once. • Data is often re-calculated • repair hardware/software bugs, • new & better algorithms • Optimization: store only popular data. • Compute this based on past use • (of this data and related data) • Balance two costs: • 1. Re_Compute_Cost / Re_Use_Interval • 2. Storage_Cost x Re_Use_Interval • Recompute is often cheaper (saves 3x we think).

  15. Use parallelism to get scaleup. • Many queries look at 100s or 1,000s of data tiles. • e.g. Berkeley weekly Landsat images since 1972. • = 1000 tape accesses. • = 4,000 tape minutes = 6 days. • Done 1,000 way parallel: = 4 minutes. • Disk & tape demands are huge: multi-GOX • Computation demands are huge: tera-ops. • Only solution: • Use parallel execution • Use parallel data access • SQL* does this for you automatically.

  16. Data Pump • Compute on demand small jobs • less than 1,000 tape mounts • less than 100 M disk accesses • less than 100 TeraOps. • (less than 30 minute response time) • For BIG JOBS scan entire 15PB database • once a day /week • Any BIG JOB can piggyback on this data scan. • DAAC in 2007:

  17. What California Proposed • 0. Design for success: expect that millions will use the system (online) • 1. DBMS centric design automates discovery, access, management • 2. Object relational databases enable • Automate access to data so that the NASA 500, • Global Change 10,000 and Internet 20,000,000 can use system. • Cache popular results, not all results (saves 3x or more) • Compute on demand (saves lots of storage and cpu). • Emphasize pull processing rather than push processing. • Use parallelism to get scaleup. • Do Batch as a data pump • 3 Be Smart Shoppers: • Use COTS hardware/software (saves 400M$) • Just-in-time acquisition (saves 400M$) • Use workstation not mainframe technology (gives 10x more stuff) • Depreciate over 3 years (ends in 2007 with "fresh" equipment) • 4. 2 + N node architecture • 2 Super DAACs for fault tolerance and for growth. • Unify the 2 "big" data storage centers with 2 big data analysis centers. • Allow many “little” peer-DAACs at science/user groups

  18. Use COTS hardware/software (saves 400M$) • Defense contractors want to build (and maintain) stuff. • (they do it for the money) • Fund SQL* (SQL-2007): Object-Relational (extensible) • supports Global Change data types • Automates access • Reliable storage • Tertiary storage • Parallel data search (automatic) • Workflow (job control) • Reliable • Fund Operations software companies (Tivoli...)

  19. Use workstation technology (NOW) • Use workstation hardware technology, • not Super Computers • 0.5$/MB of disk vs 30$/MB of disk • 100$/MIPS vs 18,000$/MIPS • 3k$/tape drive vs 50k$/tape drive • Processor, Disk, Tape ARRAYS: connected by ATM • a NOW • Gives 10x (?100x) more stuff for same dollars • Allows ad hoc query load • Allows a scaleable design • Allows same hardware: SuperDAACs = PeerDAACs

  20. Use workstation technology (NOW) • Study used RS/6000 and DEC 7000 as workstation • (they are 100k$/slice). • Should have used Compaq. • Price for 20GFlop, 24 TB disk, 2PB tape TODAY Compaq/DLT prices computed by Gray. 10% Peer DAAC costs 3M$ today, 1% Micro DAAC (200TB) costs 300K$

  21. Just-in-time acquisition (saves 400M$) • Hardware prices decline 20%-40%/year • So buy at last moment • Buy best product that day: commodity • Depreciate over 3 years so that facility is fresh. • (after 3 years, cost is 23% of original). 60% decline peaks at 10M$

  22. What California Proposed • 0. Design for success: expect that millions will use the system (online) • 1. DBMS centric design automates discovery, access, management • 2. Object relational databases enable • Automate access to data so that the NASA 500, • Global Change 10,000 and Internet 20,000,000 can use system. • Cache popular results, not all results (saves 3x or more) • Compute on demand (saves lots of storage and cpu). • Emphasize pull processing rather than push processing. • Use parallelism to get scaleup. • Do Batch as a data pump • 3 Be Smart Shoppers: • Use COTS hardware/software (saves 400M$) • Just-in-time acquisition (saves 400M$) • Use workstation not mainframe technology (gives 10x more stuff) • Depreciate over 3 years (ends in 2007 with "fresh" equipment) • 4. 2 + N node architecture • 2 Super DAACs for fault tolerance and for growth. • Unify the 2 "big" data storage centers with 2 big data analysis centers. • Allow many “little” peer-DAACs at science/user groups

  23. 2+N DAAC architecture • 2 Super-DAACs Have 2 BIG sites which • Each store ALL the data (back each other up) • no other way to archive these 15 PB databases • Each service 1/2 the queries and run a data pump • Each produces 1/2 the standard data products • Each has a BIG MIP farm next to the Byte farm • (a SCF science computation facility). • N Peer-DAACs • Each stores part of the data (got from a super DAAC) • Can be NASA sponsored or private. • Same software and hardware as Super-DAACs • Super-DAACs are “banks”, Peer-DAACs are “pubs” • careful anything goes

  24. Minimize Operations Costs • Reduced sites (DAACs) have reduced costs • Use Mosaic, Email, Telephone user support model • Count on vendors to provide: • Network management (NetView & SMTP) • Data replication • Application software version control • Workflow control • Help desk software • More reliable hardware/software

  25. Unify data storage centers with data analysis • Data analysis (Science Computation Facilities) • need quick & high bandwidth access to DB. • WAN technology is good but not that good. • WAN technology is not free. • => Co-Locate DAACs and SCFs. • => two super SCFs, many peer SCFs. • Instrument teams often find a bug or new algorithm • => reprocess all the base data to make new data set. • => ripple effect to data consumers • => must track data lineage.

  26. Budget • We had a VERY difficult time discovering a budget. • So we did our own. • It was less. • Big savings in operations and development • Hardware savings could give bigger DAACs

  27. What California Proposed • 0. Design for success: expect that millions will use the system (online) • 1. DBMS centric design automates discovery, access, management • 2. Object relational databases enable • Automate access to data so that the NASA 500, • Global Change 10,000 and Internet 20,000,000 can use system. • Cache popular results, not all results (saves 3x or more) • Compute on demand (saves lots of storage and cpu). • Emphasize pull processing rather than push processing. • Use parallelism to get scaleup. • Do Batch as a data pump • 3 Be Smart Shoppers: • Use COTS hardware/software (saves 400M$) • Just-in-time acquisition (saves 400M$) • Use workstation not mainframe technology (gives 10x more stuff) • Depreciate over 3 years (ends in 2007 with "fresh" equipment) • 4. 2 + N node architecture • 2 Super DAACs for fault tolerance and for growth. • Unify the 2 "big" data storage centers with 2 big data analysis centers. • Allow many “little” peer-DAACs at science/user groups

  28. Challenging Problems • Design the Global Change Schema • Understand data lineage • Build discovery, analysis, visualization tools • Build an OR DBMS • Including distributed, • parallel, • workflow • lazy-eager evaluation • tertiary storage, • SQL • workflow • Build a decent & reliable HSM • Build a way to operate a 1,000 node NOW.

More Related