70 likes | 152 Vues
So Much Data. www.sims.berkeley.edu/research/projects/how-much-info 1-2 exabytes per year; 250MB/yr per person on earth
E N D
So Much Data www.sims.berkeley.edu/research/projects/how-much-info 1-2 exabytes per year; 250MB/yr per person on earth (phrased as “everyone on earth writes something the size of Moby Dick 250 times a year” it makes no sense; phrased as “everyone on earth makes 15 minutes of video each year” it doesn’t sound so bad)
What kind of media? Paper: 23-240 TB/yr; mostly office documents Film: 58-427 TB/yr, mostly home snapshots Optical: 31-83 TB/yr, mostly music CDs Magnetic: 577-1693 TB/yr, mostly for computers (300TB of camcorder tape) Disk drives – 2500 petabytes per year, 55% for desktop (in 2000 they said disk was $10/GB and would reach $1 in 2005 – I saw 76 cents/GB last week)
How much online? About 100M books have been published; perhaps 200K have been digitized, half available free and half for pay. (Half in French, by the way). Very little music or video is online legally. The Web is about 10-20 TB of text; images 5X that; “deep web” or “dark matter” may be 100X as much.
Strategies for finding things Search engines: Back of book indexes, now Google Human guidance: Once citations, now hyperlinks Knowledge structures: Encyclopedias; thesauri; someday we might see PRECIS, CYC, or Semantic Web actually work Ranking as a way of combining 1 and 2 seems useful. As for the Semantic Web, Dave Parnas once wrote that “a data base is something that works, a knowledge base is something that doesn’t work”
What have you looked for? Tell us something you searched for that you couldn’t find. Was the problem that it (probably) (a) isn’t known, or (b) isn’t digitized and online, or (c) is restricted by legal or business rules, or (d) you couldn’t find it?
How should things be found? For something that you wanted to find, and believe was probably known, and probably available, how would you have liked to phrase the query? What prompted your interest? How can you formalize that interest? What kind of data description would you need?