Global social science data exchange – why do we need data and metadata standards?

Global social science data exchange – why do we need data and metadata standards? Louise Corti and Ken Miller UK Data Archive, University of Essex E-science, Manchester, June 2006

UK Data Archive • an internationally-renowned centre of expertise in data acquisition, preservation, dissemination and promotion • curator of the largest collection of digital data in the social sciences and humanities in the UK • provides resource discovery and support for the secondary use of quantitative and qualitative data in research, learning and teaching • a lead partner of the Economic and Social Data Service (ESDS) • provides preservation services for other data organisations • facilitates international data exchange

UKDA and e-science NCeSS Hub work: • contributed to early reports on grid-enabling data • membership of selection panel for NCeSS projects and nodes • metadata for social science data • identify and implement grid-enabled survey datasets • access and authentication systems for e-science • confidentiality with respect to linking data sources • text mining applications for textual data with NaCTeM • organised agenda setting workshop on simulation

Preservation • UKDA currently preserve • approximately 4,600 studies • occupying about 650GB but with capacity for more than 3TBytes on main system • 266,000 files, 56,000 directories (average file size 2.6MBytes). • growing by about 100GB per year • more than 40 years of electronic data preservation • have (so far) not lost any data!

Best practice in data management and long-term archiving • data and documentation are converted to, and held in, stable formats, which are considered to be as software and hardware independent as possible • ability to freely and intelligently read on many platforms • allows easier conversion to required format • simplifies migration to new portable format • preservation policy is changed as necessary to account for technological shifts, changes in perceived best practice and the nature of the holdings

Data sharing and access UKDA provides: • ESDS online registration service - registered users access multiple data services through a one-stop Athens single sign-on • streamlined online ordering/access system with personalised web pages of requested datasets and an instant download service • flexible multi-layered access control that checks dataset access conditions against user and usage information and online agreement to special conditions • UKDA is implementing Shibboleth middleware within the ESDS registration and access system

Data sharing and access • data catalogue – some 4600 collections • online data browsing • Nesstar - simple data analysis, visualisation, downloading and subsetting of survey and aggregate data • ESDS Qualidata online – exploring qualitative data • Beyond 20/20 – tabulating and graphing international macro databanks

NESSTAR • NESSTAR is a Semantic Web application for statistical data and metadata that aims to streamline the process of finding, accessing and analysing statistical information • The NESSTAR software suite consists of a fully integrated tool set comprising Server, Publisher and WebView

NESSTAR

The Life Cycle Model

Value Added Security Question Bank Amazon Comparative Registry variables categories concepts Questionnaire Dataset Harmonisation Publications

Building the Data Web Explore Analyse User Internet Internet Discover by Browsing and Searching Harvest and categorise using multilingual thesauri. Distributed Semantic Web (Meta)data Servers Madiera Data Portal NCeSS 2nd International Conference on e-Social Science Workshop: A Semantic Grid for Social Science

MADIERA DEMO

Qualidata Online System moves beyond catalogue searching and data download to allow: • free-text and filtered searching across multiple data collections • browsing and retrieval of textual data • increasingly, data in the system will include not only interview transcripts, but audio-visual data, links to field notes annotations • system draws on data in structured xml format

Metadata used to display search results

Transcript with recommended XML mark-up

Metadata standards in use • study description, data file description, data documentation: DDI • for data content and data annotation: the Text Encoding Initiative (TEI) • TEI - standard for text mark-up in humanities and social sciences • structural elements - turn takers, transcription error • context – micro and macro • named entities - people, orgs, places, dates, events • annotations – fieldwork or analytic • links to other data sources – audio, documents etc • using TEI consultant to help specify schema

Truly portable data formats? • important primary research data is created every day in the course of academic and policy research • Data Sharing Policies are encouraging sharing and formalised archiving of data, BUT the ideal life cycle for data creation to re-use remains beset by obstacles • the main issues involve the buying-in to a dedicated analytic strategy and typically a particular software package – SPSS or NVivo • the UKDA has seen a number of such softwares quickly become obsolete

Numeric data • SPSS.por in the 80s enabled import and export between the major statistical analytic packages • proprietary translation software for certain types of conversion e.g. StatTransfer and DBMSCopy for numeric data - SPSS to STATA • or rely on the inbuilt import and export functions of a given software package • both options poorly documented and operate on the software’s internal representations of data

XML standard for data exchange/curation • much needed open standard for curating statistical data resulting from surveys • UKDA wish to research an XML standard for data exchange/ curation • standard will be defined and expressed as an XML schema. • store ALL information in a statistical dataset including the internal metadata (variable and code labels, missing value definitions, variable level notes, variable formatting, etc.). • logical extension to the DDI • DDI acts as the metadata standard for survey datasets • new standard provides the standard for the data themselves • seeking funding for this R&D work

Qualitative data • in the qualitative data analysis software field there are no inter-software conversion tools • no dedicated import/export facilities between CAQDAS packages eg NVivo, Atlas-ti • open data exchange formats are necessary for maximising the opportunities for data sharing and long-term archiving • need a standard for data producers to store and publish data in multiple formats • e.g UK Data Archive and ESDS Qualidata Online • must meet generic needs of varied data types

Quali progress • some important progress through TEI and Australian collaboration (here at this conference) • ESRC SQUAD project is exploring: • universal (XML) standards and technologies • proposes an XML community standard (schema) that will be applicable to most qualitative data • testing a TEI conformant schema for structural mark-up of text • wish to test proposed Australian QDIF standard • gained agreement from ALL major CAQDAS software developers • seeking money to formalise testing and development

Source to Output Repositories (StoRE Project) • JISC funded under the Digital Repositores Programme • interactions between output repositories of research publications and source repositories of primary research data • user surveys to determine required functionality in both types of repository • general principles for middleware development to link source and output repositories together researched • pilot demonstrator being developed • full and extensive evaluation of the project

Metadata for social science data 40 years of metadata: • for data collections that describe, in detail and at different levels, the dataset using the international standard, the Data Documentation Initiative (DDI) and Dublin Core • scoping report with TNA to assess METS and OAIS for collections held at both archives • working with TEI to build a schema for mark-up of multi-media qualitative data collections • data contributed to JORUM in UKLOM • data contributed to JISC Registry using IESR metadata • Maintain ontologies and multilingual thesausus for social science

Contacts Louise Corti and Ken Miller UK Data Archive University of Essex Colchester, Essex CO4 3SQ Email: corti@essex.ac.uk millk@essex.ac.uk Tel: +44 (0)1206 872572/872974872001 URL: www.data-archive.ac.uk

Global social science data exchange – why do we need data and metadata standards?