1 / 16

Innovation Acceleration by Public Data Analysis

Innovation Acceleration by Public Data Analysis. Or , Big Data in Hungary - Archiving and Mining the Academic Web George Kampis, CEO PetaByte Nonprofit Research Ltd. PetaByte Nonprofit Research Ltd. www.dynanets.org. www.textrend.org. www.futurict.szte.hu.

aderes
Télécharger la présentation

Innovation Acceleration by Public Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Innovation Acceleration by Public Data Analysis Or, Big Data in Hungary - Archivingand Mining the Academic Web George Kampis, CEO PetaByteNonprofit Research Ltd.

  2. PetaByteNonprofit Research Ltd. www.dynanets.org www.textrend.org www.futurict.szte.hu www.petabyte-research.org www.hungarianscience.org

  3. Whatwe do in futurict.hu • In thecontextofscientificresearchandhighereducation (in particular, in Hungary): • Investment andreturn („ROI“) analysis • „scienceofsuccess“ • Structuralanalysisofinstitutions • www.hungarianscience.org • http://www.oktatas.hu/felsooktatas/projektek/tamop721_eszafejl/projekthirek/hazai_tudomanymetriai_felmeres • New formsofpublication (e.g. datasharing in papers)

  4. Context in FuturICT • „Innovation Accelerator“ • .. tohelp (scientific) innovationwith [...] socialmediaaswellasdataservices Helbing, D., & Balietti, S. (2011). Howtocreate an innovationaccelerator. The European Physical Journal Special Topics, 195(1), 101-136. van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoreticalandtechnologicalbuildingblocksfor an innovationaccelerator. The European Physical Journal Special Topics, 214(1), 183-214. Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a NonlinearProcess, theScientometricPerspective, andtheSpecificationof an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming). „BIG DATA“

  5. PartiallySimilardevelopments • Mendeley • Reference manager andcollaborationnetwork • ResearchGate • Research networkand publicationsportalw/ qualityassessment • Altmetrics • Article-level online metrics • VIVO • Connect, share, discover

  6. Big (web) datais A key • Big Data in Google trends • „deepdata“ • controversy... • Massive Web Data: harvesting / archiving • Google itself... • The Internet Archive • UK web archive, British Library

  7. Web archiving in hungary • None. Nope. • „MIA“ (Magyar Internet Archivum, HU Internet Archive) • Variousdocuments, plansandsmall-scalepilots • Since 2006 • Ourambition: toarchiveandmine HU academia = „HUA“ • 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.) • 42 HAS (HU AcadSci) researchinstitutes • 47 highereducationentities (universitiesandpolytechnics) • Now in collaborationwith: OSZK (National Library), NIIF...

  8. A running„HUA“ pilotin petabyte/Futurict.hu • Hardware: Dell T710 server(2x4 core Xeon E5520, 48GB RAM, 2TB HDD) • Software: Heritrixcrawlerscalledfrom API and CURL, spawnedfromtimedsripts... • Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip • Manytechnicalissues: Flash pages, portletcontainers (e.g. WebSphere), CMSs (e.g. Joomla)... • Operation since April 2013. • Longitudinal archiving in mirrorformat (2-weekly periods), using a form of „diff“ in owndevelopment

  9. THE Processing ofresults • Future plans: keywordextraction, timed (dynamic) keywordnets, correlationwithsupportprogramsandgrantcalls (toanalyze ROI in publications, citations, ...terms) • „The Science ofSuccess“ (A.-L. Barabási) • http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience • Bottleneck: availabilityofpublicfundingdata, needfor open data initiatives enforcement • In thispilotphase: basicstatistics, turnoverrates etc.

  10. Howbigisbig?

  11. Quick results, basicstats • All 89 HU academicinsitutitions: 86GB total (text 42GB) • Rank distributions (total) HAS Higher Ed.

  12. Quick results, basicstats 2. • Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps) HAS Higher Ed.

  13. Quick firstinsights • (Outliersarechem.catalogsviz. astronomydatasets) • Average size: 974 MB per site (median: 137 MB [!]) • Average textsize: 474 MB per site (median: 47 MB [!]) • Forcomparison: • Kampis website @ ELTE = 180 MB (textonly) • Hypothesis: usefulcomparisonsandmetricspossible • Add dynamicaspect...

  14. Conclusions, suggestions • Veryfirststeps, only 2 monthsintothepilot • Data intensive, hasnaturaltiming • Big (web) dataareimportantforresearchassessment • Big dataareoftensmall (also elsewhere...) • Suggestsitselfforreadilyavailableindexesand derivative measures • Wehaveshown a simplestyetinstructivecase(„sizematters“) • Caveat: neednormalizations!

  15. Future works... Are Leftto The (NEAR) future

  16. Thankyou! • Coworkers: Laszlo Gulyas(PhD), Sandor Soos(PhD), Balazs Balint (MSc), Zsolt Juranyi (BSc), Attila Palmai (BScstudent) • This work was partiallysupportedbythe European Union andthe European Social Fund throughprojectFuturICT.hu (grantno.: TÁMOP-4.2.2.C-11/1/KONV-2012-0013).

More Related