1 / 10

Classical Model: Web Harvesting

Classical Model: Web Harvesting. Pull from queue. GET /. W/ARC -. text/css image/gif image/jpg video JavaScript. HTTP/1.0 200 OK. Differences Between a Crawler and a Browser. Browsers grab all embedded resources as soon as possible

abiola
Télécharger la présentation

Classical Model: Web Harvesting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classical Model: Web Harvesting Pull from queue GET / W/ARC - text/cssimage/gif image/jpg videoJavaScript HTTP/1.0 200 OK

  2. Differences Between a Crawler and a Browser Browsers grab all embedded resources as soon as possible Typical behavior is a burst of traffic followed by long pauses. Crawlers have to play by different rules Typical behavior is sustained traffic. Can quickly overwhelm a website Must apply intentional delays Must obey robots.txt rules

  3. In Your Browser: Behind the Scenes

  4. Browser Experiments Integrate a browser into link extraction pipeline: Log all requests and queue in Heritrix Headless browsers Smaller & Faster No need to render the page visually PhantomJS - built on webkit engine (Safari, iOS, Chrome, Android) Auto-QA, snapshot generation, scripted navigation…

  5. A Different Approach Merging browsers & crawlers.…

  6. Recent Use Cases & Implementations Open Planets (browser extractor module for H3): https://www.github.com/openplanets/wap INA (browser w/inline caching proxy; simulates user actions): https://github.com/davidrapin/fantomas NDIIPP/NDSA (a different hybrid approach…): https://github.com/adam-miller/ExternalBrowserExtractorHTML https://github.com/adam-miller/phantomBrowserExtractor (PhantomJS behavior scripts)

  7. How much do we gain? Traditional Link Extraction: Baseline Test 7444 URIs (200 response) 795 URIs (404 response Browser only (full instance or scripted headless): ~30% less content PhantomJS (WITH traditional link extractor): +24% + Significant improvement in unique URI detection - Additional processing overhead …but can distribute load to dedicated browser nodes + Browser downloads in a separate workflow, asynchronous from Heritrix + JavaScript analytics

  8. Other Strategies/Implementations • Data Mining & Analytics • Pre-Crawl Seed & Link Analysis • Link/Script Analysis during an Active Crawl • Post Crawl Link/Script Analysis, Patching & Auto QA • Native Feeds & Alternate Capture Methods • Data format and context is as important as the content • E.g. • Snapshot Generation & Recording

  9. This is NOT Your Grandfather’s Web Kris Carpenter NegulescuDirector, WebInternet Archive kcarpenter (at) archive (dot) org

More Related