230 likes | 269 Vues
Explore how Alexa, the web navigation service, mines the internet for quality content and important information. Discover the vast data repositories like the Library of Alexandria and Congress, and how Alexa's Internet Archive stores and processes data. Learn about web stats, usage paths, and the importance of being precise in providing information to users.
E N D
Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet brewster@alexa.com
To Answer Any Question... • Know a lot • Know what is important • Be right enough Alexa: The web navigation service that learns from people
Know Alot: Other Repositories • Library of Alexandria: 800GB (400k scrolls @2MB) • Library of Congress: 20TB (20M books, ascii) • Dialog Information Service: 3-5TB • Video Store: 8TB (5k videos, 1GB/hr) • Public Branch Library: 3TB (300k scanned books) • Radio Station: 1TB (15k hrs of music) • . . . Alexa’s Internet Archive: 10TB
Protocol Number of Sites Total Data New stuff GB/month WWW 640,000 4,000 GB 1000 (http) gopher 5,000 100 GB ? ftp 10,000 5,000 GB ? NetNews 33,000 500 GB 30 groups Know A lot: Gathering • Web Snapshot on T3 in 20 days • User’s Paths essential as well
Web Stats • 1million sites, doubling every 6 months (millions of authors) • More videos, dynamic pages, Java etc. • 15 links on each page
Storage Snapshot of the Web on Tape Jukebox costs $80k
Knowing what is Important:Mining the WWW for Quality • Content: 100 million pages • Link Structure: 750 million links • Usage paths: many 100 million hits
Be Right Enough: being useful • Competition • Directories: • Biggest only links to < 1% of the WebPages • Search Engines: • Returning 1000’s of hits (sometimes millions) • Trends: • Move to “channels” of less content, but good • limit crawling (50M pages and holding)
Be Right Enough: Alexa • Where am I? • Where do I want to go? • Alexa: • “Can I trust this information” • What should I look at next?