250 likes | 362 Vues
This overview outlines the key discussions and strategic goals from the EuroCRIS members meeting held in Bonn, May 2013. It emphasizes the need for improved integration of structured and unstructured data sources within an open environment. Key objectives include merging multiple data sources, applying entity resolution techniques to break down data silos, and developing a roadmap towards an operational architecture for enhanced knowledge management. The discussion also highlighted the importance of leveraging open FRIS data for better service delivery and policy monitoring.
E N D
Research Information Linked Open Data Store euroCRIS members meeting, Bonn, may 2013
Overview • Needs & Drivers • Information and data sources • Structured • Unstructerd • Architecture • Planned • Realised • Tools
Project • Partners • Knowledge Management unit, EWI • IBM Belgium • Goals • Merge all sources into one open environment. • Apply entity resolution technique to remove data silo’s • Crawling and content analysis of full text elements • Build and test the proposed Pilot Architecture • Information integration form structured and unstructured data in one container • Build a number of visualisations of the information • Develop a roadmap towards the Operational Architecture • Timing: • 4 months starting from January 20113 • Cost • 124k euro
Needs & drivers Better information: correct, actual , complete Open FRIS data for services and application devellopment Flemish government Open Data policy Maximum reuse of components Increase strategic intelligence Maximum reuse of data Policy monitoring: efficient & effective Connect data silo’s More information services Reduce system costs
Information and Data sources • Structured Data • FRIS research portal database • Format: CERIF2006 database • Coverage: All universities 1 university college • 4 university OAR’s • Format: MODS records • Coverage: X publication records, X full tekst resources • VABB-SSH: publication monitoring data set on Social Sciences and Humanities • Format: MODS records • Coverage: All universities • Semantics and information model • Business Semantics Glossary • FRIS model: CERIF2006 • Semantics: Entitiy Classifications
Information and Data sources • Unstructured Data • All textual information form the structured data • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Full text of Publication • Websites • Project • Researcher • Organisation
Links andLocators • Access to unstructured data • Textual elements in CERIF model • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Websites • URI fields in CERIF entities • Links to fulltext • Resource links in MODS records
Somenumbers • CERIF records: • Person:22.006 (FRIS) +1.454.208 (OAI without resolution) • Project:24.634 (FRIS) • Organisation:1.398 (OAI) + 2.022 (FRIS) • Publications: 3.596(FRIS) • MODS records • OAR’s:598.035 (OAI) + VABB database • Publication Full text :45.294 (OAI)
PlannedArchitecture Identifiers & EntityResolution Content Analysis Concept Extraction Visualisation Triple Store Structured Data input Operational Store Semantic control
OAR Harvesting Architecture Crawler management OAI-PMH Crawler UGent MODS to CERIF conversion CERIF database D2Rtransformation UHasselt … XML VABB
Architectuur – Tools & Standards BSG SBVR Jena HTTP D2R TDB REST Java Java SPARQL SKOS OWL RDFS WEB 2.0 APACHE FUSEKI Oracle TOMCAT RDF CERIF SILK R2R SIEVE ICA ICC HARVESTER OAI-PMH LDIF UIMA MODS
Somenumbers • Entities • Projecten: 24.634 (FRIS) • Personen: 22.006 (FRIS) +1.454.208 (OAI zonderresolutie!)) • Publicaties: 598.035 (OAI) + 3.596 (FRIS) • With full text: 45.294 (OAI) • OrgUnit: 1.398 (OAI) + 2.022 (FRIS) • Recognised author affiliation from full text: 55662 • Triple Store • Triples FRIS+OAI : 57M • Triples text mining (author recognition + lemmas) : 144M • --> Still without inference (no inference deduce triples)
Visualisations • Two test visualisations build sofar: • Word cloud for person • http://ewisclod3.vlaanderen.be/words/ • Persons related to Concepts • http://ewisclod3.vlaanderen.be/persons/ • New visualisations will be build on well defined use cases • Tuning the Content analytics to the case • Supervised learning for specific domains • Give an contextual overview of research from the last 10 years on social security issues in Belgium • Annual report on research in the domain of renewable energy
Entityresolution • A few tools tested • Silk Link Discovery Framework • used to map authors from the OAR harvest onto Persons form the CERIF sources. • Experimented with • manual construction of matching ruls via de Silk workbench • Active learning combined with the Silk generic algoritms • Several metrics on the tekst dimensions: Levenstein, tf-idf, Jaro, Jacard in combination with numerical and temporal dimensions • Results still have to be validated in detail. • Tests with OKKAM are planned
Architecture Roadmap Elements (optional) Replace D2R with standard: R2RML Full-CERIF automatic D2R template generation Support incremental CERIF/RDF loading Integration of Data Governance Center via he API Complete modelling of CERIF and Semantics in Data Governance Center Full-CERIF automatic ontology template generation manueel geautomatiseerd
D2R Views • FRIS: http://ewisclod3.vlaanderen.be/d2rq/fris/ • OAI-PMH: http://ewisclod3.vlaanderen.be/d2rq/oai/ • Text Mining: http://ewisclod3.vlaanderen.be/d2rq/tm/ • SPARQL • Test pagina: http://ewisclod3.vlaanderen.be/ewilod/html/sparql-test.html • Endpoint (enkel query): http://ewisclod3.vlaanderen.be/ewilod/sparql • RESTful API (GET) • Resource basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/resource/ • Ontologie basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/ontology • TriplestoregrafeURIs • FRIS: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#fris • OAI-PMH: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#oai • TextMining: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#tm • Mappings: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#ld • LDIF • Status monitor: http://ewisclod3.vlaanderen.be/ldif/status/ • Silk • Workbench: http://localhost:8080(via SSH tunnel) • Visualisaties • Index pagina: http://ewisclod3.vlaanderen.be/ewilod/html/vis/index.html • Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/http://ewisclod3.vlaanderen.be/words/
Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/http://ewisclod3.vlaanderen.be/words/