1 / 19

Building Scalable Web Archives

Building Scalable Web Archives. Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014. Internet Memory Foundation. Internet Memory Foundation (European Archive) Established in 2004 in Amsterdam and then in Paris:

matt
Télécharger la présentation

Building Scalable Web Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014

  2. Internet Memory Foundation • Internet Memory Foundation (European Archive) • Established in 2004 in Amsterdam and then in Paris: • Mission: Preserve Web content by building a shared WA platform • Actions: Dissemination, R&D and partnerships with research groups and cultural institutions • Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc. • Internet Memory Research • Spin-off of IMF established in June 2011 in Paris • Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction)

  3. Internet Memory Foundation Focusedcrawling: • Automated crawls through the Archivethe.Netsharedplatform • Qualityfocused crawls : • Video capture (You Tube channels), Twittercrawls, complex crawls Large scalecrawling • Inhousedevelopeddistributed software • Scalable crawler: MemoryBot • Alsodesigned for focused crawl and complexscoping

  4. Research projects Web Archiving and Preservation • Living Web Archives (2007-2010) • Archives to CommunityMEMories: (2010-2013) • SCAlablePreservationEnvironment(2010-2013) Webscale data Archiving and Extraction • Living Knowledge(2009-2012) • Longitudinal Analytics of Web Archive data (2010-2013)

  5. MemoryBot design (1) • Started in 2010 with the support of the LAWA (Longitudinal Analytics of Web Archive data) project • URL store designed for large-scale crawls (DRUM) • Built in Erlang: distributed and fault-tolerant system language • Distributed (consistent hashing) • Robust: topology change adaptation, memory usage regulation, process isolation

  6. MemoryBot design (2)

  7. MemoryBot performance • Good throughput and slow decrease • 85 resources written per second, slowing to 55 after 4 weeks on a nine 8-core servers cluster (32 GiB of RAM)

  8. MemoryBot counters

  9. MemoryBot counters

  10. MemoryBot – quality • Support of HTTPS, retries on server failure, configurable URL canonicalisation • Scope: domain suffixes, language, hops sequence, white lists, black lists • Priorities • Trap detection (URL pattern identification, within PLD duplicate detection)

  11. MemoryBot – multi-crawl • Easier management • Politeness observed across different crawls • Better resource utilisation

  12. IM Infrastructure Green datacenters • Through a collaboration withNoRack • Designed for massive storage (petabytes of data) • Highlyscalable/lowconsumption • Reducesstorage and processingcosts Repository: • HDFS (Hadoop File System): Distributed, fault-tolerant file system • Hbase. A distributedkey-value index(temporal archives) • MapReduce: A distributedexecutionframework

  13. IM Platform (1) Data storage: • temporal aspect (versions) Organised data: • Fast and easyaccess to content • Easyprocessing distribution (Big Data) Severalviews on same data: • Raw, extractedand/or analysed Takes care of data replication: • No (W)ARC synchronisation required

  14. IM Platform (2) Extensive characterisation and data mining actions: • Process and reprocess information any time depending on needs/requests • Extract information such as MIME type, textresources, images metadata, etc.

  15. SCAlable Preservation Environment (SCAPE) QA/Preservation challenges? • Growingsize of web archives • Ephemeral and heterogenous content • Costlytools/actions • Developscalablequality assurance tools • Enhanceexistingcharacterisationtools

  16. Visual automated QA: Pagelizer • Visual and structural comparisontooldevelopped by the UPMC as part of SCAPE • Trained and enhancedthrough a collaboration with IMF • Wrapped by IMF team to beusedat large scalewithinitsplatform • Allowscomparison of two web pages snapshots • Provides a similarity score as an output

  17. Visual automated QA: Pagelizer • Tested on 13 000 pairs of URLs (Firefox & Opera) • 75% of correct assessment • Whole workflow runs for around 4 seconds/pair • 2 seconds for screenshot (depends on page rendered) • 2 seconds for comparison • Performance already cut per 2 since initial tests (map reduce)

  18. Next steps Improvements are to be made: • Performance • Robustness • Correctness New test in progress on a large scale crawl: • Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!

  19. Thank you.Any questions? http://internetmemory.org - http://archivethe.net florent.carpentier@internetmemory.org leila.medjkoune@internetmemory.org

More Related