Exploring NLP-Driven Solutions for Marking Up Qualitative Data in Social Science Research

SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP UK DATA ARCHIVE-NLP COLLABORATION WHAT IS SQUAD? USING NLP TOOLS Main aim:to explore methodological and technical solutions for ‘exposing’ digital qualitative data to make them fully shareable and exploitable. Main objectives Information Extraction (IE) is a sub-field of NLP which aims to identify key pieces of information in texts using 'shallow' analysis techniques. A typical IE system will perform Named Entity Recognition where particular kinds of proper names and terms are identified, classified and marked up. ESDS Qualidata is using semi-automated mark-up of some components of its data collections using natural language processing (NLP) and information extraction: • specify, test and propose an eXtended Markup Language (XML) schema for storing and marking up qualitative data • investigate requirements for contextualising qualitative data and developing standards for data documentation • develop semi-automated using natural language processing tools for preparing marked up qualitative data for sharing • research tools for publishing and interrogating data via the web – Qualitative Data Mark-Up Tools (QDMT) • new partnerships created – new methods, tools and jargon to learn • new area of application for NLP to social science data • growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as publications’ abstracts WHAT FEATURES DO WE NEED TO MARK-UP AND WHY? Collaboration between: • UK Data Archive, University of Essex (lead partner) • Language Technology Group, Human Communication Research Centre, School of Informatics, University of Edinburgh This is a means of annotating documents with semantic metadata – enabling highly resource discovery and data exploration. The Java interface tool developed in SQUAD is called CME. Spoken interview texts provide the clearest and most common example of the types of encoding features needed. There are three basic groups of structural features: • utterance, specific turn taker, defining idiosyncrasies in transcription • links to analytic annotation and other data types (e.g. thematic codes, • concepts,audio or video links, researcher annotations) • identifying information such as real names, company names, place names, • occupations, temporal information ANNOTATION TOOL - ANONYMISE This tool imports marked up data from the CME NLP system. Named entities are highlighted and co-reference chains – e.g numerous references to a single person - are identified. Identify atomic elements of information in text: 18 months duration 1 March 2005 – 31 October 2006 • personal names • company/organisation names • locations • dates • times • percentages • occupations • monetary amounts • Example: • Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc to become operations director of Arthur Anderson. METADATA STANDARDS The XML schema will specify a ‘reduced’ set of Text Encoding Initiative (TEI) elements: • core tag set for transcription • names, numbers, dates <persname> • links and cross references <ref> • notes and annotations <note> • text structure <body> • unique to spoken texts <kinesic> • linking, segmentation and alignment <link> • advanced pointing - XPointer framework • text and AV synchronisation • contextual information (participants, setting, text) CAPTURING AND DEFINING DATA CONTEXT . Rich context enables informed re-use of data. But defining how to provide context for raw data to make it more ‘usable’ is complex. ESDS Qualidata has spent ten years working in the area of sharing qualitative data, and has done much to establish informal ways of documenting raw data. Both micro and macro level features should be considered including: how the research question was framed, the research application process, project progress, fieldwork situations, analyses processes. Fieldwork observations are useful as are timelines and political chronologies. Equally when undertaking a replication or restudy, detailed information on sampling procedures, field work approaches and question guides will be essential. Names can be anonymised with chosen pseudonyms. The references of names to pseudonyms is saved. Annotations are explored in an XML format in the NITE NXT model. NXT uses ‘stand off’ annotation – where annotation is linked to or referenced by words. interview text with XML tags embedded • There's just one or two factual things first of all do you mind my asking how old you are? • 49. • And what schools did you go to? • • <orgName>King Street</orgName>,<orgName>Woodside</orgName>and <orgName>Hilton</orgName>. • • Uh-huh .. and how old were you when you left the school? • 14. • And you work at the moment? What sort of work do you do? • - • Well I've gone back to get shorter hours, I've went back to domestic, which I dinna really care for. But then I used to be in the pharmacy department at • <orgName>ARI</orgName> • ... just • <seg type="occupation">pharmacy assistant</seg> XML: enabling a standardised format for interview transcripts SQUAD has identified a minimal generic set of elements that represent a baseline for contextualising data. QUADS has produced an edited collection on this issue as a special edition of the Journal in Methodological Innovations Online. sirius.soc.plymouth.ac.uk/~andyp/. Information about interviewee Date of birth: 1930 Gender: female Marital status: married Occupation: pharmacy assistant Geographic region: Scotland LP:There's just one or two factual things first of all do you mind my asking how old you are? G24:49. LP:And what schools did you go to? G24:King Street, Woodside and Hilton. LP:Uh-huh .. and how old were you when you left the school? G24:14. LP:And you work at the moment? What sort of work do you do? G24:Well I've gone back to get shorter hours, I've went back to domestic, which I dinna really care for. But then I used to be in the pharmacy department at ARI ... just pharmacy assistant. At least it was better than cleanin'! But then they've nae part-time workers there so.. LP:And did you work in the pharmacy long? DATA EXCHANGE STANDARDS A uniform format for richly encoding qualitative research is necessary as it: AUDIOVISUAL ARCHIVING • enables preservation and re-use of metadata, data and annotation • ensures consistency of presentation and description of data • supports the development of common web-based publishing and search tools • facilitates data interchange (e.g. CADAS packages) and comparison among datasets Progress: • limited formal definition of a common XML vocabulary and Document Type Definition (DTD) based on the Text Encoding Initiative (TEI) • testing of a new Qualitative Data Interchange Format (QDIF) Archiving and exposure of qualitative data in a way that faithfully represents its origins and context is important. Linking qualitative data to other distributed data sources such as audio-visual or geo-coded data sources, such as maps can afford creative and exciting ways of visualising data. XML: enabling web-enabled display, search and browse The formalised and systematic archiving and sharing of digital audio-visual data from qualitative research is fairly new. SQUAD is helping to explore XML representation and display of audio-visual data. CONTACT TOOLS PROGRESS • From Autumn 2006: • formalising data exchange standard • key word extraction systems to help conceptually index qualitative data – text mining collaboration • exploring grid-enabling data: e-social science collaboration • defined header metadata for a standardised transcript • defined and tested generic XML models for qualitative data • tested and refined NLP tools for qualitative data • built front end to NLP named entity tools • chosen software to enable annotation of data • explored data export formats for longer-term archiving • investigated powerful XML based indexing tools for searching and retrieving data • investigated web display of multimedia data and pointers to other resources using XML - extending the functionality of ESDS Qualidata Louise Corti and Claire Grover UK Data ArchiveUniversity of EssexColchester, Essex CO4 3SQ Email: quads@esds.ac.ukTel: +44 (0)1206 872145 URL: quads.esds.ac.uk/squad quads.esds.ac.uk/squad

Exploring NLP-Driven Solutions for Marking Up Qualitative Data in Social Science Research

Exploring NLP-Driven Solutions for Marking Up Qualitative Data in Social Science Research

Presentation Transcript

Louise Corti UK Data Archive University of Essex

Archive IFM Data

A DATACITE CASE STUDY FROM THE UK DATA ARCHIVE

UK – Dutch collaboration

Why Archive Environmental Data?

NLP and Big Data

CERN Data Archive

A Magellan Data Archive

CISL Research Data Archive

HISTORICAL POPULATION RESOURCES AT THE UK DATA ARCHIVE

Syteline Data Archive

Data Collaboration

Swift data archive

HAND OUTS DExT Project UK Data Archive September 2007

Data Archive & Distribution

UK Renal Data Collaboration

Metadata and the UK Data Archive

Louise Corti Head ESDS Qualidata UK Data Archive, UK

Archive Data to ORACLE

Data Archive Centres

4TU.ResearchData data archive

Archive IFM Data

Exploring NLP-Driven Solutions for Marking Up Qualitative Data in Social Science Research