Automated indexing of survey questionnaires and interviews

Automated indexing of survey questionnaires and interviews Louise Corti UK Data Archive, University of Essex NaCTeM, Manchester 25 January 2008

Data collections at UKDA • social, economic and historical ‘data collections’ • @5000 data collections = studies • studies – primary data derived social research methods • surveys, collated statistics, qualitative interviewing, fieldwork and observation

Resource discovery at UKDA • how do our users currently locate data? • at the highest level, a study level metadata record is compiled for each study following the DDI XML-based metadata schema • free text fields plus controlled vocabulary • DDI - all the national social science data archives in the world use this standard, so harvesting across collections is possible

Resource discovery at UKDA • simplest resource discovery is a free text search or on some key fields from the catalogue records, eg ‘health’

Free text search

Use of key words • UKDA manually assign key words to (index) data for resource discovery purposes: • surveys: study description and question level • qualitative data: study level, methods • key words used thus describe the methodology but not the research data per se

Key word search Key words

Key words

Key words are manually assigned at the survey question level but captured in a database at the study level This is very laborious as there can be hundreds per study! BUT key words are NOT linked to questions!! ie at UKDA there is NO correspondence in the current metadata schema….

Keywords searches can be refined by the user making use of the UKDA thesaurus of terms Select this term and a new search is run

UKDA Thesaurus • HASSET (Humanities and Social Science Thesaurus ) is a subject thesaurus which has been developed by the UKDA over the past 20 years • Initially based on the UNESCO thesaurus, it has been continuously expanded and updated for use in the UKDA’s online catalogue • display of the hierarchical relationships of terms can help users to broaden a search or make it more specific. Cross referencing to synonyms suggests alternative search terms, as does the provision of links to other conceptually related terms • it employs the conventional range of term relationships of equivalence (preferred and non-preferred terms), the hierarchical relationships (broader and narrower terms) and the associative relationships (related terms) • stored in SQL tables and multi lingual version ELSST developed for EC

Metadata • DDI allows for: • study level description • methodology and data description, authors, rights management and access etc. • file level • description of individual files e.g spss files, work file, audio file (but not currently used in-house) • question (variable) level • question description, text, var names and values, groups PLUS key words (again not used)

Variable search

But key words NOT linked to the variable!

Indexing survey questions • survey questions • Do you suffer from any long-standing limiting illness? • Keywords assigned : long-tem illness • Government survey questions are often standardised to provide comparability across surveys • Have large databases of individual questions

Semi-automated solutions • UKDA indexing……there must be an easier way! • first a database of questions linked to key terms (controlled vocab) must be built to test any automated assignment • methodology and coder reliability should also be investigated: • in-house guidelines are in place but still subjective assignment • no stringent quality control on key word assignment

Key words for qualitative data • a different challenge • indexing done at study level and is largely conceptual • work needs to be done on how researchers assign key words to data and how they search for qualitative data • analysis of data processing methods in house • analysis of UKDA search logs ..what terms you users enter? • Can utilise named entity recognition, term extraction and document summarisation tools on these kinds of data (eg an unstructured transcribed interview)

What about using NaCTeM tools? • given that we can provide databases of terms linked to data (study, file and parts) • could test NaCTeM tools data to assign terms or concepts/summarise text • nice front end processing tools are essential • processors must have option to agree or edit any terms • terms should be output to DDI XML metadata at the study, file and variable level

Structural and content mark up of textual interview data for spoken interview texts, useful encoding features are: • utterance, specific turn taker, defining idiosyncrasies in transcription • links to analytic annotation and other data types (e.g.. thematic codes, concepts, audio or video links, researcher memos, maps, images, URLs etc.) • identifying information such as real names, company names, place names, occupations, temporal information

An sample interview ID: 001 Sex: M YOB: 1921 Place: Oldham Finalocc: Postman U id='1' who='interviewer' Right, it starts with your grandparents. So give me the names and dates of birth of both. Do you remember those sets of grandparents? U id='2' who='subject' Yes. U id='3' who='interviewer' Well, we'll start with your mum's parents? Where did they live? U id='4' who='subject' They lived in Widness, Lancashire. U id='5' who='interviewer' How do you remember them? U id='6' who='subject' When we Mum used to take me to see them and me Grandma came to live with us in the end, didn't she? U id='7' who='Welham' Welham: Yes, when Granddad died - '48. U id='8' who='interviewer' So he died when he was 48? U id='9' who='Welham' Welham: No, he was 52. He died in 1948. U id='10' who='interviewer' But I remember it. How old would I be then? U id='11' who='Welham' Welham: Oh, you would have been little then. U id='12' who='subject' I remember him, he used to have whiskers. He used to put me on his knee and give me a kiss. .

ESRC SQUAD project • developed and tested universal standards and technologies • long-term digital archiving • publishing • data exchange • investigated user-friendly tools for semi-automating processes already used to prepare qualitative data and materials • formatted text documents ready for output • mark-up of structural features of textual data • annotation and anonymisation tool • automated coding/indexing linked to a domain ontology

Identifying elements • Identify atomic elements of information in text • Person names • Company/Organisation names • Occupations • Locations • Dates and times • Example: • Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc to become operations director of Arthur Anderson

Testing NLP tools • UKDA have investigated some basic NLP tools to identify named entities with a nice GUI • part of ESRC SQUAD award • rules can be written but obviously geared to domain specificity. Individual interviews can cover almost any subject! • system tuned to a sample of routine interview data…not jargon-laden

XML schema - TEI • main aim = to tag data with key XML elements • work on an XML schema has specified a ‘reduced’ set of Text Encoding Initiative (TEI) elements: • core tag set for transcription • names, numbers, dates <persname> • links and cross references <ref> • text structure <body> • unique to spoken texts <kinesic> • contextual information (participants, setting, text) New XML schema developed under JISC funding (DEXT) called QuDEx to describe annotation, linking, segmentation and alignment of qualitative data (www.data-archive.ac.uk/dext)

Transcript with manual XML mark-up 28

Automated XML mark-up: input data file for NLP tools

Data processed through Edinburgh LT-XML and CME tools The main Graphical User Interface (GUI) Invokes the SQUADCoder in NXT

NXT tool Locate the NXT metadata file: which must be set up with named entity types The NXT generic window – running the SQUAD Coder

The SQUADCoder Window All the references to a particular entity The Named Entity Hierarchy Transcription view

Annotation tool - anonymise The Coreference Action Panel

Annotation tool Enter pseudonym

Anonymised data The Anonymised Transcription View

Annotated data in the NXT: what formats and how stored? • NXT uses ‘stand off’ annotation – annotation linked to or references individual words • uses the NITE NXT XML model • creates new anonymised version of the text • save original file • save matrix of references - names to pseudonyms • outputs annotations – who worked on the file etc.

Next steps • these are all demo tools.. none taken any further..project funding ended • I would like collaboration on annotation of data through semantic tagging and document mark-up • automatic term recognition and XML element tagging • automatic document classification – indexing • auto summarisation of text – document reduction • possibly detecting structural relationships • can NacTeM (ASSERT) tools be used to undertake: • key word assignment for survey questions and structured catalogue records • term extraction, summarisation and mark-up of spoken interview data • coreferencing in interviews?

Louise Corti UK Data Archive 01206 872145 corti@essex.ac.uk

Automated indexing of survey questionnaires and interviews