1 / 19

Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers

Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers. Robert Gaizauskas 1 , Patrick Herring 1 , Michael Oakes 1 Micheline Beaulieu 2 , Peter Willett 2 , Helene Fowkes 2 , and Anna Jonsson 2

vevina
Télécharger la présentation

Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers Robert Gaizauskas1, Patrick Herring1, Michael Oakes1 Micheline Beaulieu2, Peter Willett2, Helene Fowkes2, and Anna Jonsson2 1Department of Computer Science, 2Department of Information Studies University of Sheffield

  2. Outline of Talk • Is Information Extraction Technology Useful? • Barriers to Deployment • Information Seeking in Large Enterprises • The TRESTLE System • System Overview • NEAT: Named Entity Access to Text • SCAT: Scenario Access to Text • Preliminary User Evaluation • Evaluation Methodology • Access Strategies • User Perceptions • Conclusions and Discussion HLT01, San Diego

  3. Is Information Extraction Technology Useful? • Information Extraction (IE) technology has led to impressive new abilities to extract structured information from texts • Named entity recognition • Template Element/Relation filling • Scenario Template filling • IE complements traditional Information Retrieval (IR) capabilities • However, unlike IR, IE has not found its way into widely used end-user systems, such as • Web search engines • Document indexing systems • Why not? HLT01, San Diego

  4. Barriers to Deployment • Porting Cost • Moving to new domains requires considerable time + expertise • to create/modify domain-specific resources + rule bases • to annotate texts for supervised machine learning approaches • Sensitivity to inaccuracies in extracted data • MUC-7 results – F-measure scores 50-92% depending on task • Thus, IE only appropriate for applications where some error is tolerable/readily detectable by end users • Note: formal IR evaluation results comparable, but application contexts make error less significant • Complexity of integration into end-user systems • IE systems’outputs must be incorporated into largerapplication systems, if end users are to benefitfrom them HLT01, San Diego

  5. IE and Information Seeking in Large Enterprises • To investigate the utility of IE in a real setting have developed an advanced text access facility to support informationworkers at GlaxoSmithKline • TRESTLE – Text Retrieval Extraction and Summarisation Technology for Large Enterprises • Aim: increase effectiveness of employees in “industry watch” function – current awareness/tracking of • People • Companies • Products – particularly progress of new drugs throughclinical trial/regulatory approval process • Approach: provide enhanced access toScripthe largest circulationpharmaceutical industry newsletter HLT01, San Diego

  6. IE and Information Seeking in Large Enterprises • User requirements study at GSK (questionnaire, observation, interviews) revealed 2 key types of information seeking: • Current awareness • general updating (what's happened in the industry today/this week) • entity or event-based tracking (e.g. what's happened concerning a specific drug or what regulatory decisions have been made) • Retrospectivesearch • historical tracking ofentities or events of interest (e.g. where has a specific person beenreported before, what is the clinical trial history of a particulardrug) • search for a specific event or a remembered context in whicha specific entity played a role Note: both activities require identification of entities/events in the news = what IE systems do HLT01, San Diego

  7. TRESTLE System Overview • The system consists of two components • Off-line component • LaSIE IE system • Input: Scrip texts delivered daily via the Internet • Output: IE results • Named entities: MUC-7 categories + drugs + diseases • Scenario templates: Person Tracking; Clinical Trials; Regulatory Announcements • Summary Writer • Input: Scenario templates • Output: Single sentence NL summaries of the templates • Entity/Scenario Indexer • Input: NE annotated texts; Scenario templates • Output: Indices keyed by NE + date with pointers to source texts HLT01, San Diego

  8. TRESTLE System Overview (cont) • On-line component • Browser scripts • Input: User requests for information • Output: Results to requests returned from annotated Scrip DB • Entity/Scenario Index Search + Dynamic Page Generator • Input: User information requests forwarded from Web server + entity/scenario indices + NE annotated texts/summaries • Output: Relevant HTML pages with link info dynamically generated link information HLT01, San Diego

  9. Info Seeking Internet Web Browser Web Server User Off-Line System LaSIE System NE Tagged Texts Scrip Index Search + Dynamic Page Creator Scenario Templates Entity/ Scenario Indices Indexer Scenario Summaries Summary Writer TRESTLE System Architecture HLT01, San Diego

  10. TRESTLE Interface Overview • TRESTLE browser-based interface allows 4 routes to access texts: • by headline • by named entity (NEAT: Named Entity Access to Text) • by scenario summary (SCAT: Scenario Access to Text) • by freetext search • For first 3 routes date range ofaccessed articles may be set to • current day • previous day • last week • lastfour weeks • full archive HLT01, San Diego

  11. TRESTLE Interface: Underlying Design Head Frame • Head Frame • User state • Date range selection Access Frame Index Frame • Access Frame • Choose access mode • NE/Scenario/free text search • Index Frame • Headline list, or • NE + headline list, or • Summary list Text Frame • Text Frame • Full text of source text • embedded NE hyperlinks HLT01, San Diego

  12. NEAT: Named Entity Access to Text HLT01, San Diego RUN

  13. SCAT: Scenario Access To Text RUN HLT01, San Diego

  14. Preliminary User Evaluation: Methodology • Prelude to full end-user study: preliminary study with 8 Information Studies postgrad students • Aim: to gain insight into • ease of use and learnability of the system • preferred strategies for accessing text • problems in interpreting the interface • Instruments: usability questionnaire, verbalprotocols, observational notes • Procedure: • brief verbal introduction to evaluation and system • undirected exploration of system, asking questions/providing comments • simulated tasks of real end-user You've heard that one of your colleagues, Mr Garcia, has recentlyaccepted an appointmentat another pharmaceutical company. You wantto find out which company he will be movingto and what post he hastaken up. HLT01, San Diego

  15. Preliminary User Evaluation: Access Strategies • NEAT: access to named entities was made available in three ways: • by clicking directly on a list of NE categories in the access frame • through the NE index look up query box in the access frame • through highlighted entries in a full article displayed in the text frame Observation: users preferred 2 over 1 or 3, regardless of task • perhaps because users knew what they were looking for • perhaps more familiar than browsing NE’s • perhaps because of prominence of NE lookup box in interface • SCAT: Observation: for tasks where SCAT was appropriate users opted for NE index lookup • perhaps because of novelty of scenario tracking • perhaps because SCAT functionality not clear from interface HLT01, San Diego

  16. Preliminary User Evaluation: User Perceptions • Colour coding + hyper-linking of NE’s • Highly noticeable; some objections to colour choice • Disagreement about utility – distracting when reading full texts, but highly useful in leading to related previous Scrip • Integration of current awareness + retrospective searching via NE’s highly appreciated • NE index look-up • Found very useful by all but one participant • Some confusion over scope – differences wrt free-text search/only 5 searchable NE categories • Exact string matching limiting (limitation now removed) • Scenario Tracking • Function misunderstood from labelling in access frame • Confusion between SCAT summaries and headlines • Flag icons for summaries in headline lists not well understood HLT01, San Diego

  17. Conclusions (I) • To date IE largelya “technology push” activity • For IE technology to become usable and influenced by end user requirements (“user pull”), enduserprototypes must be built which: • exploit thesignificant achievement of the technology to date • acknowledge its limitations • TRESTLE attempts to do this by exploiting NE and scenario templateIE technology to offer users • novel ways to access textual information • via a familiar text browsing interface HLT01, San Diego

  18. Conclusions (II) • Preliminary user evaluation has revealed: • search optionsinitially selected from the access frame were not alwaysoptimal forset tasks • on the whole colour-coded textual/iconiccuein headline index + full text enabled users to exploit the different functions seamlessly • interface supported interaction at procedurallevel, but some misunderstanding at the conceptual level – esp. scenario access • other studiesreportsimilar issues inintroducing more complex interactive search functions • further investigation + modifications (e.g. to labelling) underway • Full evaluation in real end user environment now being organised • To answer question: can professional information workers use IE-based searching and awareness approaches effectively? HLT01, San Diego

  19. The End HLT01, San Diego

More Related