1 / 14

Archive overview and projects too

Archive overview and projects too. Important links. Need to sign up for “library cards” http://www.archive.org/account/login.createaccount.php Then you can access following pages: www.archive.org/web/researcher/researcher.php www.archive.org/web/researcher/data_available.php

Télécharger la présentation

Archive overview and projects too

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Archive overviewand projects too

  2. Important links • Need to sign up for “library cards” • http://www.archive.org/account/login.createaccount.php • Then you can access following pages: • www.archive.org/web/researcher/researcher.php • www.archive.org/web/researcher/data_available.php • www.archive.org/web/researcher/parallel.php • www.archive.org/web/researcher/example_research_create_arc.php

  3. Machine overview • Data stored on ~200 desktop computers • Host names: ia00xxx (e.g., ia00660) • Initially, you’ll use ia0010[0-7] • Four 160GB drives on each • /0, /1, /2, and /3 • /1-/3 filled to capacity • /0 filled to 1/2 capacity • /0/tmp is “temp” space for computations

  4. Your account • Fill out form at: http://www.soe.ucsc.edu/~raymie/290g-userinfo.html • I’ll take it from there • Expect an e-mail

  5. Files • ARC files -- contain raw data • Multiple doc’s/file, ~100MB per file • DAT files -- contain commonly-used fields • CDX files -- index of ARC and DAT • /0/tmp/complete.cdx -- per machine • Archive-wide cdx’s on 6 machines (wayback) • All compressed (ARC on page boundaries)

  6. ARC format

  7. DAT format

  8. Programs • Unix tools • grep, join, cut, Awk, perl, screen(!), ... • Alexa tools • P2

  9. Alexa tools • av_arcfilter, av_cat, av_getpage, av_grep, av_prepend_random, av_randomize, av_search, av_sort

  10. P2 • Based on data-parallel programming model • SIMD, single-instruction, multiple data • Thinking machines • Idea: run the same command line on all

  11. P2 • P2 program [-c combiner] -p machines • program: command-line to be run • combiner: program to combine results • machines: machines to use • “-p /net/ia00100 /net/ia00101” • “-p $rack1” • $rack[1-5], $arcs

  12. P2 - example • p2 uptime -p $ARCS • Returns result of uptime on all machines • p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p .. • Returns length (in lines) of indexes

  13. p2 • Output of “subprograms” sent to initiating “p2” program • This program “combines” these lines • By default, av_cat is used to get them to standard output • The -c option allows the user to set a combiner • But lines from subprograms can be interleaved

  14. Crawl catalog Counts & histograms Page-change Word-change study Language id Table detection RSS download/studies Id “soft” 404/30x’s Mirror detection Javascript link extract Storage redundancy URL database Validating host counts IP sampling vs. crawls Correcting for vrt. host Possible projects

More Related