80 likes | 280 Vues
JSI News Crawler. Blaz Novak, Mitja Trampus , Blaz Fortuna, Marko Grobelnik JSI. JSI News Crawler. The goal is to collect most of worlds news articles including relevant blog posts Why collecting data? To be independent of commercial data providers
E N D
JSI News Crawler Blaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik JSI
JSI News Crawler • The goal is to collect most of worlds news articles including relevant blog posts • Why collecting data? • To be independent of commercial data providers • Since commercial data providers (like Spinn3r, GNIP, DataSift) are expensive and not flexible in terms of data sources and additional services • To provide data stream free of charge for research • What data is available? • Database dumps • Articles annotated with Enrycher metadata • Similar articles clusters • Real-time feed
Architecture • Content in form: • Clean text • Linguistics • Social Graph • LOD Links • Time Database of Collected Articles Open Web JSI Crawler Enrycher XML/RDF Control Panel Web Service API Developers Real-Time Analytics Archive Explorer
Current statistics • Data sources: ~110.000 unique websites • Stream size: ~192.000 articles/day • ~150 distinct languages • good coverage of minority languages • Current archive of ~35.000.000 articles • Clear-text and language identification available
Download volume, yearly scale (2010) Control Panel Todays download volume, after adding 3k new sources + 1 week of backlog Average and maximum number of story articles in a cluster (today)
Plans • In the first half of 2012 the plan is to release the service for public use • …in the future additional semantic annotation services will be added to providing additional value to the streamed data