250 likes | 379 Vues
Distributed Tera-Mining. R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc. Trend 1. Explosion of Data …. … All in the Wrong Format. With no one to analyze it. The Data Gap. Most data comes a GB and a TB at a time. The Data Gap.
E N D
Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.
… All in the Wrong Format With no one to analyze it.
The Data Gap Most data comes a GB and a TB at a time. The Data Gap Total new disk (TB) since 1995 New Ph.D.s
Trend 2. Sonet is dead. Lambda Rules. Gigabytes can be moved in seconds.
Trend 3: Most Data is Distributed • Bush’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 1: ENSO & Cholera El Nino Data at NCAR Cholera Data at WHO
Table 2 Table 1 Example 2: Voting
DataSpace – One Approach to Making Data Useful Complementary to the grid, which we view as a distributed computer. • html • http • search by keyword • workstations servers • pmml & dtml • dstp • correlate & mine • data & compute clusters • 16 terabytes of documents • 4 billion documents Today’sMulti-media Web Tomorrow’sData Web • petabytes of data • tens of billions to trillions of records
DSTP Server 2 DSTP Server 1 k[i], y[j] k[i], x[i] Click to obtain graph UCK [uckid] attributes [aid]
Terra Mining Testbed Optical testbed for distributed tera miningof scientific data. Goal also to be testbed forbroadband based business services.
Lessons Learned • It’s the data stupid. Cycles, cylinders & lambdas are all commodities. • The fundamental challenge: lower the cost to make data useful. • The emergence of internet infrastructure for data is inevitable. Opens up possibilities for new types of scientific discoveries.
For More Information • DataSpace http://www.dataspaceweb.net http://www.ncdm.uic.edu • DataSpace Standards http://www.dmg.org • Selected articles http://www.twocultures.net • Magnify • http://www.magnify.com
OC-3 OC-12 OC-48 Trend 2. Bandwidth is a Commodity
Distributed Exabytes (New Disks) Petabytes 1 Exabyte Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review"
Trend 3: Most Data is Distributed • W’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.