170 likes | 320 Vues
IWIR-CRIS '06. Data retrieval in PURE Data retrieval in the 4-year old PURE CRIS project at 9 universities. Agenda. Overview Retrieval Validated manual data gathering Dynamic integration to local back-end systems Aggregation, enrichment and import of historic data
E N D
IWIR-CRIS '06 Data retrieval in PURE Data retrieval in the 4-year old PURE CRIS project at 9 universities
Agenda • Overview • Retrieval • Validated manual data gathering • Dynamic integration to local back-end systems • Aggregation, enrichment and import of historic data • Experiments with automated imports of historic data • Exposure • Two web services • OAI • Z39.50 • Reports • Portal framework • Archiving • Near future
Overview • Brief overview • … in order to discuss ingestion, integration, conversion and import in a specific context
Overview • Brief overview • History • Development begun in 2002 • Users • 9 universities (DK+SE), several hospitals + other research institutions • Platform and architecture • J2EE enterprise application • Release management: All users have instances of same release version, same code-base • Business model • Commercial software licenses, powerful user group, shared budgets • Modular • Basic module, Reporting module, Student thesis module, External publications module, Bibliometrics module, Press module.
Retrieval • Manual data gathering • User roles/right + workflow: • = de-centralized data gathering • = validated data gathering • = continuous data gathering • GUI example • Management focus is necessary • Reports and statistics, KPI-management, etc. • Adding value to researchers is necessary • Instantly in Google indexes, instantly updated personal websites, instantly updated CV, increased citations (source in paper), etc.
Retrieval • Dynamic integration • Dynamic integration to local back-end systems: • Personnel systems, payroll systems (for data retrieval) • LDAPs, Active Directories (for data retrieval + authentication) • Single sign-on systems (for authentication) • … to automatically create object types such as “person” or “organization” • … and yes, PURE hosts data, too • We need complete objects according to the meta-data model • Plug-in architecture in PURE: • Pro = individually adapted integration • Con = individually programmed plug-in necessary • Future = GUI, standardized plug-ins
Retrieval • Import • Historic data • Many sources • More or less useful data • More or less consequent use of formats :-) • The PXA format • PURE XML Archive format - .zip based • Meta-data, relations between entities, binary files • Aggregation > enrichment > conversion > import • The process is external to PURE
Retrieval • Experiments • Experiments with automated imports of historic data from specific, identified sources • [source format] > PXA conversion > import > enrichment/validation • Very poor data quality demands the concept of “draft objects” in PURE
Exposure • Web services • RPC/encoded + document/literal • Rich libraries of methods • Including format-specific methods: APA, MLA, HARVARD, VANCOUVER and CBE • Free and near-instant adding of methods • WS code example (if time)
Exposure • OAI support • OAI-PMH data provider • OAI-PMH formats • DC • DDF-MXD (Danish national format) • SVEP (Swedish national format) • … more to come • Also used to harvest other PURE-repositories for “external publications”
Exposure • Z39.50 • Enabling of searches in PURE from library systems • SRW/SRU
Exposure • Reports • PURE reporting module • GUI example
Exposure • Reference manager • Export of data to local Reference Manager installation • Using RM-formatted export file • Promotes registering to the repository rather than in RM • GUI example
Exposure • Portal framework • PUREportal – free PURE-specific framework for custom development of research exhibition portals • Online example • Typical cost scenario € 20,000 • Typical delivery time 1 month • Little need for requirements specification • Automatic PURE-API maintenance
Archiving • Data archiving – 2 levels • SQL environment • Meta-data and relations • Binary files just stored in server file system • FEDORA via connector (not PURE-specific, Open Source) • Facilitates: • Higher quality archival of binary files • Long term preservation in general • Adoption of PURE in institutions’ general FEDORA strategies
Near future • The near future regarding data retrieval • More automated imports using increasingly advanced converters • Automated data delivery (push and harvest) to: • Industry specific search services (e.g. PubMed, Nordicom) • Documentary data collections (such as clinicaltrials.org), and national collections (such as DDF (DK), ForskDok (NO), etc. • Temporary import objects • When imported data are not in sufficient quality to create valid objects • when data cannot be properly related to other objects upon import