Download
innovation acceleration by public data analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Innovation Acceleration by Public Data Analysis PowerPoint Presentation
Download Presentation
Innovation Acceleration by Public Data Analysis

Innovation Acceleration by Public Data Analysis

128 Vues Download Presentation
Télécharger la présentation

Innovation Acceleration by Public Data Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Innovation Acceleration by Public Data Analysis Or, Big Data in Hungary - Archivingand Mining the Academic Web George Kampis, CEO PetaByteNonprofit Research Ltd.

  2. PetaByteNonprofit Research Ltd. www.dynanets.org www.textrend.org www.futurict.szte.hu www.petabyte-research.org www.hungarianscience.org

  3. Whatwe do in futurict.hu • In thecontextofscientificresearchandhighereducation (in particular, in Hungary): • Investment andreturn („ROI“) analysis • „scienceofsuccess“ • Structuralanalysisofinstitutions • www.hungarianscience.org • http://www.oktatas.hu/felsooktatas/projektek/tamop721_eszafejl/projekthirek/hazai_tudomanymetriai_felmeres • New formsofpublication (e.g. datasharing in papers)

  4. Context in FuturICT • „Innovation Accelerator“ • .. tohelp (scientific) innovationwith [...] socialmediaaswellasdataservices Helbing, D., & Balietti, S. (2011). Howtocreate an innovationaccelerator. The European Physical Journal Special Topics, 195(1), 101-136. van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoreticalandtechnologicalbuildingblocksfor an innovationaccelerator. The European Physical Journal Special Topics, 214(1), 183-214. Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a NonlinearProcess, theScientometricPerspective, andtheSpecificationof an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming). „BIG DATA“

  5. PartiallySimilardevelopments • Mendeley • Reference manager andcollaborationnetwork • ResearchGate • Research networkand publicationsportalw/ qualityassessment • Altmetrics • Article-level online metrics • VIVO • Connect, share, discover

  6. Big (web) datais A key • Big Data in Google trends • „deepdata“ • controversy... • Massive Web Data: harvesting / archiving • Google itself... • The Internet Archive • UK web archive, British Library

  7. Web archiving in hungary • None. Nope. • „MIA“ (Magyar Internet Archivum, HU Internet Archive) • Variousdocuments, plansandsmall-scalepilots • Since 2006 • Ourambition: toarchiveandmine HU academia = „HUA“ • 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.) • 42 HAS (HU AcadSci) researchinstitutes • 47 highereducationentities (universitiesandpolytechnics) • Now in collaborationwith: OSZK (National Library), NIIF...

  8. A running„HUA“ pilotin petabyte/Futurict.hu • Hardware: Dell T710 server(2x4 core Xeon E5520, 48GB RAM, 2TB HDD) • Software: Heritrixcrawlerscalledfrom API and CURL, spawnedfromtimedsripts... • Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip • Manytechnicalissues: Flash pages, portletcontainers (e.g. WebSphere), CMSs (e.g. Joomla)... • Operation since April 2013. • Longitudinal archiving in mirrorformat (2-weekly periods), using a form of „diff“ in owndevelopment

  9. THE Processing ofresults • Future plans: keywordextraction, timed (dynamic) keywordnets, correlationwithsupportprogramsandgrantcalls (toanalyze ROI in publications, citations, ...terms) • „The Science ofSuccess“ (A.-L. Barabási) • http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience • Bottleneck: availabilityofpublicfundingdata, needfor open data initiatives enforcement • In thispilotphase: basicstatistics, turnoverrates etc.

  10. Howbigisbig?

  11. Quick results, basicstats • All 89 HU academicinsitutitions: 86GB total (text 42GB) • Rank distributions (total) HAS Higher Ed.

  12. Quick results, basicstats 2. • Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps) HAS Higher Ed.

  13. Quick firstinsights • (Outliersarechem.catalogsviz. astronomydatasets) • Average size: 974 MB per site (median: 137 MB [!]) • Average textsize: 474 MB per site (median: 47 MB [!]) • Forcomparison: • Kampis website @ ELTE = 180 MB (textonly) • Hypothesis: usefulcomparisonsandmetricspossible • Add dynamicaspect...

  14. Conclusions, suggestions • Veryfirststeps, only 2 monthsintothepilot • Data intensive, hasnaturaltiming • Big (web) dataareimportantforresearchassessment • Big dataareoftensmall (also elsewhere...) • Suggestsitselfforreadilyavailableindexesand derivative measures • Wehaveshown a simplestyetinstructivecase(„sizematters“) • Caveat: neednormalizations!

  15. Future works... Are Leftto The (NEAR) future

  16. Thankyou! • Coworkers: Laszlo Gulyas(PhD), Sandor Soos(PhD), Balazs Balint (MSc), Zsolt Juranyi (BSc), Attila Palmai (BScstudent) • This work was partiallysupportedbythe European Union andthe European Social Fund throughprojectFuturICT.hu (grantno.: TÁMOP-4.2.2.C-11/1/KONV-2012-0013).