190 likes | 302 Vues
Learn about Internet Memory Foundation's mission to preserve web content through a shared platform, their research projects, and the innovative MemoryBot design for large-scale crawls and extraction.
E N D
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014
Internet Memory Foundation • Internet Memory Foundation (European Archive) • Established in 2004 in Amsterdam and then in Paris: • Mission: Preserve Web content by building a shared WA platform • Actions: Dissemination, R&D and partnerships with research groups and cultural institutions • Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc. • Internet Memory Research • Spin-off of IMF established in June 2011 in Paris • Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction)
Internet Memory Foundation Focusedcrawling: • Automated crawls through the Archivethe.Netsharedplatform • Qualityfocused crawls : • Video capture (You Tube channels), Twittercrawls, complex crawls Large scalecrawling • Inhousedevelopeddistributed software • Scalable crawler: MemoryBot • Alsodesigned for focused crawl and complexscoping
Research projects Web Archiving and Preservation • Living Web Archives (2007-2010) • Archives to CommunityMEMories: (2010-2013) • SCAlablePreservationEnvironment(2010-2013) Webscale data Archiving and Extraction • Living Knowledge(2009-2012) • Longitudinal Analytics of Web Archive data (2010-2013)
MemoryBot design (1) • Started in 2010 with the support of the LAWA (Longitudinal Analytics of Web Archive data) project • URL store designed for large-scale crawls (DRUM) • Built in Erlang: distributed and fault-tolerant system language • Distributed (consistent hashing) • Robust: topology change adaptation, memory usage regulation, process isolation
MemoryBot performance • Good throughput and slow decrease • 85 resources written per second, slowing to 55 after 4 weeks on a nine 8-core servers cluster (32 GiB of RAM)
MemoryBot – quality • Support of HTTPS, retries on server failure, configurable URL canonicalisation • Scope: domain suffixes, language, hops sequence, white lists, black lists • Priorities • Trap detection (URL pattern identification, within PLD duplicate detection)
MemoryBot – multi-crawl • Easier management • Politeness observed across different crawls • Better resource utilisation
IM Infrastructure Green datacenters • Through a collaboration withNoRack • Designed for massive storage (petabytes of data) • Highlyscalable/lowconsumption • Reducesstorage and processingcosts Repository: • HDFS (Hadoop File System): Distributed, fault-tolerant file system • Hbase. A distributedkey-value index(temporal archives) • MapReduce: A distributedexecutionframework
IM Platform (1) Data storage: • temporal aspect (versions) Organised data: • Fast and easyaccess to content • Easyprocessing distribution (Big Data) Severalviews on same data: • Raw, extractedand/or analysed Takes care of data replication: • No (W)ARC synchronisation required
IM Platform (2) Extensive characterisation and data mining actions: • Process and reprocess information any time depending on needs/requests • Extract information such as MIME type, textresources, images metadata, etc.
SCAlable Preservation Environment (SCAPE) QA/Preservation challenges? • Growingsize of web archives • Ephemeral and heterogenous content • Costlytools/actions • Developscalablequality assurance tools • Enhanceexistingcharacterisationtools
Visual automated QA: Pagelizer • Visual and structural comparisontooldevelopped by the UPMC as part of SCAPE • Trained and enhancedthrough a collaboration with IMF • Wrapped by IMF team to beusedat large scalewithinitsplatform • Allowscomparison of two web pages snapshots • Provides a similarity score as an output
Visual automated QA: Pagelizer • Tested on 13 000 pairs of URLs (Firefox & Opera) • 75% of correct assessment • Whole workflow runs for around 4 seconds/pair • 2 seconds for screenshot (depends on page rendered) • 2 seconds for comparison • Performance already cut per 2 since initial tests (map reduce)
Next steps Improvements are to be made: • Performance • Robustness • Correctness New test in progress on a large scale crawl: • Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!
Thank you.Any questions? http://internetmemory.org - http://archivethe.net florent.carpentier@internetmemory.org leila.medjkoune@internetmemory.org