JSI News Crawler

JSI News Crawler Blaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik JSI

JSI News Crawler • The goal is to collect most of worlds news articles including relevant blog posts • Why collecting data? • To be independent of commercial data providers • Since commercial data providers (like Spinn3r, GNIP, DataSift) are expensive and not flexible in terms of data sources and additional services • To provide data stream free of charge for research • What data is available? • Database dumps • Articles annotated with Enrycher metadata • Similar articles clusters • Real-time feed

Architecture • Content in form: • Clean text • Linguistics • Social Graph • LOD Links • Time Database of Collected Articles Open Web JSI Crawler Enrycher XML/RDF Control Panel Web Service API Developers Real-Time Analytics Archive Explorer

Current statistics • Data sources: ~110.000 unique websites • Stream size: ~192.000 articles/day • ~150 distinct languages • good coverage of minority languages • Current archive of ~35.000.000 articles • Clear-text and language identification available

Sample Article from the stream

Download volume, yearly scale (2010) Control Panel Todays download volume, after adding 3k new sources + 1 week of backlog Average and maximum number of story articles in a cluster (today)

Plans • In the first half of 2012 the plan is to release the service for public use • …in the future additional semantic annotation services will be added to providing additional value to the streamed data

JSI News Crawler

JSI News Crawler

Presentation Transcript

JSI Vision

JSI Sensor Middleware

JSI Sensor Middleware

Web crawler

Web Crawler

Gnutella Crawler

Current Status of JSI

Crawler policy document

Focused Crawler

Steve Hodgins MCHIP/ JSI (presenting), Amada Pomeroy MCHIP/ JSI, Hiwot Belay MCHIP/ JSI,

Steve Hodgins MCHIP/ JSI (presenting), Amada Pomeroy MCHIP/ JSI, Hiwot Belay MCHIP/ JSI,

Crawler Excavator Market

Crawler manuals