1 / 79

The Data Avalanche

Explore the exponential growth of information in the digital age, foreseeing a future with personal petabytes and advanced interfaces. Learn about data storage capabilities, human-computer interaction advancements, and the changing landscape of information technology.

lucyjenny
Télécharger la présentation

The Data Avalanche

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Data Avalanche Talk at University of Tokyo, Japan October 2005 Jim Gray Microsoft Research Gray@Microsoft.com http://research.microsoft.com/~Gray

  2. NumbersTeraBytes and Gigabytes are BIG! • Mega – a house in san francisco • Giga – a very rich person • Tera – ~ The Bush national debt • Peta – more than all the money in the world • A Gigabyte: the Human Genome • A Terabyte: 150 mile long shelf of books.

  3. Yotta Zetta Exa Peta Tera Giga Mega Kilo Outline Historical trends imply that in 20 years: • we can store everything in cyberspace.The personal petabyte. • computers will have natural interfacesspeech recognition/synthesisvision, object recognition beyond OCR Implications • The information avalanche will only get worse. • The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing • Organizing, summarizing, prioritizinginformation is a key technology. We are here

  4. How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

  5. Things Have Changed 1956 • IBM 305 RAMAC • 10 MB disk • ~1M$ (y2004 $)

  6. 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2004 Microprocessor 1.0 year doubling The Next 50 years will see MORE CHANGE ops/s/$ Had Three Growth Curves 1890-1990 Combination of Hans Moravac + Larry Roberts + Gordon Bell WordSize*ops/s/sysprice

  7. Constant Cost or Constant Function? • 100x improvement per decade • Same function 100x cheaper • 100x more function for same price Mainframe SMP Constellation Cluster Constant Price Mini SMP Constellation Workstation Graphics/storage Lower Price – New Category PDA Camera/browser

  8. Growth Comes From NEW Apps • The 10M$ computer of 1980 costs 1k$ today • If we were still doing the same things,IT would be a 0 B$/y industry • NEW things absorb the new capacity

  9. The Surprise-Free Futurein 20 years. • 10,000x more power for same price • Personal supercomputer • Personal petabyte stores • Same function for 10,000x less cost. • Smart dust --the penny PC? • The 10 peta-op computer (for 1,000$).

  10. 10,000x would change things • Human computer interface • Decent computer vision • Decent computer speech recognition • Decent computer speech synthesis • Vast information stores • Ability to search and abstract the stores.

  11. How Good is HCI Today? • Surprisingly good. • Demo of making faces http://research.microsoft.com/research/pubs/view.aspx?pubid=290 • Demo of speech synthesis • Daisy, Hal • Synthetic voice • Speech recognition is improving fast, • Vision getting better • Pen computing finally a reality. • Displays improving fast (compared to last 30 years)

  12. Yotta Zetta Exa Peta Tera Giga Mega Kilo Outline Historical trends imply that in 20 years: • we can store everything in cyberspace.The personal petabyte. • computers will have natural interfacesspeech recognition/synthesisvision, object recognition beyond OCR Implications • The information avalanche will only get worse. • The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing • Organizing, summarizing, prioritizinginformation is a key technology. We are here

  13. How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Almost everything is recorded digitally. • Most bytes are never seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All Books MultiMedia All books (words) .Movie A Photo A Book

  14. And >90% in Cyberspace Because: Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize

  15. MyLifeBits The guinea pig • Gordon Bell is digitizing his life • Has now scanned virtually all: • Books written (and read when possible) • Personal documents (correspondence, memos, email, bills, legal,0…) • Photos • Posters, paintings, photo of things (artifacts, …medals, plaques) • Home movies and videos • CD collection • And, of course, all PC files • Recording: phone, radio, TV, web pages… conversations • Paperless throughout 2002. 12” scanned, 12’ discarded. • Only 30GB Excluding videos • Video is 2+ TB and growing fast

  16. Capture and encoding

  17. I mean everything

  18. 25Kday life ~ Personal Petabyte 1PB Will anyone look at web pages in 2020? Probably new modalities & media will dominate then.

  19. Challenges • Capture: Get the bits in • Organize: Index them • Manage: No worries about loss or space • Curate/ Annotate: atutomate where possible • Privacy: Keep safe from theft. • Summarize: Give thumbnail summaries • Interface: how ask/anticipate questions • Present: show it in understandable ways.

  20. MemexAs We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

  21. Too much storage?Try to fill a terabyte in a year Petabyte volume has to be some form of video.

  22. How Will We Find Anything? • Need Queries, Indexing, Pivoting, Scalability, Backup, Replication,Online update, Set-oriented access • If you don’t use a DBMS, you will implement one! • Simple logical structure: • Blob and link is all that is inherent • Additional properties (facets == extra tables)and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data SQL ++DBMS

  23. Photos

  24. Searching: the most useful app? • Challenge: What questions for useful results? • Many ways to present answers

  25. Detail view

  26. Resource explorerAncestor (collections), annotations, descendant& preview panes turned on

  27. Synchronized timelines with histogram guide

  28. Value of media depends on annotations • “Its just bits until it is annotated”

  29. System annotations provide base level of value • Date 7/7/2000

  30. Tracking usage – even better • Date 7/7/2000. Opened 30 times, emailed to 10 people (its valued by the user!)

  31. Get the user to say a little something is a big jump • Date 7/7/2000. Opened 30 times, emailed to 10 people. “BARC dim sum intern farewell Lunch”

  32. We took him to lunch at our favorite Dim Sum place to say farewell Dapeng was an intern at BARC for the summer of 2000 At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, Jim Getting the user to tell a story is the ultimate in media value • A story is a “layout” in time and space • Most valuable content (by selection, and by being well annotated) • Stories must include links to any media they use (for future navigation/search – “transclusion”). • Cf: MovieMaker; Creative Memories PhotoAlbums

  33. We took him to lunch at our favorite Dim Sum place to say farewell Dapeng was an intern at BARC for the summer of 2000 At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, Jim Value of media depends on annotations “Its just bits until it is annotated” • Auto-annotate whenever possible e.g. GPS cameras • Make manual annotation as easy as possible. XP photo capture, voice, photos with voice, etc • Support gang annotation • Make stories easy

  34. 80% of data is personal / individual. But, what about the other 20%? • Business • Wall Mart online: 1PB and growing…. • Paradox: most “transaction” systems < 1 PB. • Have to go to image/data monitoring for big data • Government • Government is the biggest business. • Science • LOTS of data.

  35. CERN Tier 0 Instruments: CERN – LHCPeta Bytes per Year Looking for the Higgs Particle • Sensors: 1000 GB/s (1TB/s ~ 30 EB/y) • Events 75 GB/s • Filtered 5 GB/s • Reduced 0.1 GB/s ~ 2 PB/y • Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

  36. Information Avalanche • Both • better observational instruments and • Better simulations are producing a data avalanche • Examples • Turbulence: 100 TB simulation then mine the Information • BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information • CERN: LHC will generate 1GB/s 10 PB/y • VLBA (NRAO) generates 1GB/s today • NCBI: “only ½ TB” but doubling each year, very rich dataset. • Pixar: 100 TB/Movie Image courtesy of C. Meneveau & A. Szalay @ JHU

  37. Q: Where will the Data Come From?A: Sensor Applications • Earth Observation • 15 PB by 2007 • Medical Images & Information + Health Monitoring • Potential 1 GB/patient/y  1 EB/y • Video Monitoring • ~1E8 video cameras @ 1E5 MBps  10TB/s  100 EB/y filtered??? • Airplane Engines • 1 GB sensor data/flight, • 100,000 engine hours/day • 30PB/y • Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

  38. The Big Picture Experiments & Instruments facts • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it • How to coexist with others questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems • Query and Vis tools • Support/training • Performance • Execute queries in a minute • Batch query scheduling

  39. FTP - GREP • Download (FTP and GREP) are not adequate • You can GREP 1 MB in a second • You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • Oh!, and 1PB ~3,000 disks • At some point we need indices to limit searchparallel data search and analysis • This is where databases can help • Next generation technique: Data Exploration • Bring the analysis to the data!

  40. The Speed Problem • Many users want to search the whole DBad hoc queries, often combinatorial • Want ~ 1 minute response • Brute force (parallel search): • 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB • Indices (limit search, do column store) • 1,000x less equipment: 1M$/PB • Pre-compute answer • No one knows how do it for all questions.

  41. Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? • Relax notion of optimal (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Combination of statistics & computer science

  42. Analysis and Databases • Much statistical analysis deals with • Creating uniform samples – • data filtering • Assembling relevant subsets • Estimating completeness • censoring bad data • Counting and building histograms • Generating Monte-Carlo subsets • Likelihood calculations • Hypothesis testing • Traditionally these are performed on files • Most of these tasks are much better done inside a database • Move Mohamed to the mountain, not the mountain to Mohamed.

  43. Yotta Zetta Exa Peta Tera Giga Mega Kilo Outline Historical trends imply that in 20 years: • we can store everything in cyberspace.The personal petabyte. • computers will have natural interfacesspeech recognition/synthesisvision, object recognition beyond OCR Implications • The information avalanche will only get worse. • The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing • Organizing, summarizing, prioritizinginformation is a key technology. We are here

  44. The Evolution of Science • Observational Science • Scientist gathers data by direct observation • Scientist analyzes data • Analytical Science • Scientist builds analytical model • Makes predictions. • Computational Science • Simulate analytical model • Validate model and makes predictions • Data Exploration Science Data captured by instrumentsOr data generated by simulator • Processed by software • Placed in a database / files • Scientist analyzes database / files

  45. e-Science • Data captured by instrumentsOr data generated by simulator • Processed by software • Placed in a files or database • Scientist analyzes files / database • Virtual laboratories • Networks connecting e-Scientists • Strong support from funding agencies • Better use of resources • Primitive today

  46. e-Science is Data Mining • There are LOTS of data • people cannot examine most of it. • Need computers to do analysis. • Manual or Automatic Exploration • Manual: person suggests hypothesis, computer checks hypothesis • Automatic: Computer suggests hypothesis person evaluates significance • Given an arbitrary parameter space: • Data Clusters • Points between Data Clusters • Isolated Data Clusters • Isolated Data Groups • Holes in Data Clusters • Isolated Points Nichol et al. 2001 Slide courtesy of and adapted from Robert Brunner @ CalTech.

  47. TerraServer/TerraServicehttp://terraService.Net/ • US Geological Survey Photo (DOQ) & Topo (DRG) images online. • On Internet since June 1998 • Operated by Microsoft Corporation • Cross Indexed with • Home sales, • Demographics, • Encyclopedia • A web service • 20 TB data source • 10 M web hits/day

  48. Digital OrthoQuads 18 TB, 260,000 files uncompressed Digitized aerial imagery 88% coverage conterminous US 1 meter resolution < 10 years old Digital Raster Graphics 1 TB compressed TIFF, 65,000 files Scanned topographic maps 100% U.S. coverage 1:24,000, 1:100,000 and 1:250,000 scale maps Maps vary in age USGS Image Data

  49. User Interface Concept • Display Imagery: • 316 m 200 x 200 pixel images • 7 level image pyramid • Resolution 1 meter/pixel to 64 meter/pixel • Navigation Tools: • 1.5 m place names • “Click-on” Coverage map • Longitude and Latitude search • U.S. Address Search • External Geo-Spatial Links to: • USGS On-line Stream Gauges • Home Advisor Demographics • Home Advisor Real Estate • Encarta Articles • Steam flow gauges Concept: User navigates an ‘almost seamless’ image of earth Click on image to zoom in Buttons to pan NW, N, NE, W, E, SW, S, SE Links to switch between Topo, Imagery, and Relief data Links to Print, Download and view meta-data information

More Related