1 / 28

Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State Un

Using semantic components to represent and search domain-specific documents: An evaluation of indexing accuracy and consistency. Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University. ISKO 2010 Conference

tehya
Télécharger la présentation

Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State Un

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using semantic components to represent and search domain-specific documents: An evaluation of indexing accuracy and consistency Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University ISKO 2010 Conference Sapienza University of Rome, Faculty of Philosophy February 23 - 26, 2010 Marianne Lykke

  2. Agenda • Problem and motivation • Semantic component model • Research questions • Test design • Results • Conclusions Marianne Lykke

  3. Problem and motivation • Challenges for information retrieval in domain-specific digital libraries: • Domain-specific libraries often contain large sets of similar documents about few topics • Importantto be able to distinguish between topical similar documents • Domain experts often have specific information needs targeting a single “right answer”, specified by domain-specific facets. • Important to be able to limit search to domain-specific dimensions • (e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al., 2006) Marianne Lykke

  4. Problem and motivation • Little time for information retrieval • Important that then relevant documents are highly ranked and retrieved by first query • Distributed indexing, carried out by indexers with varied degree of indexing competence • Importantto address classical indexing problems: quality, exhaustivity, specificity, consistency • (e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al., 2006) Marianne Lykke

  5. Semantic component model • Semantic components model developed to facilitate formulation of specific, structured queries covering the search topic exhaustively by domain-specific dimensions • Two-level model dividing a given collection into a set of document classes, each class with an associated set of semantic components • Based on assumptions that • Domain experts know document genres within a certain domain: content and structure (Dillon, 1991; Orlikowski & Yates, 1994; Bishop, 1999; Vaughan & Dillon, 2005) • Domain-specific document content and structure correspond to domain-specific information needs (Ely et al, 1999,2000; Price, Delcambre, Nielsen, 2006) Marianne Lykke

  6. Document class: Clinical method SC: General information SC: Practical information Marianne Lykke

  7. SC: General information Document class: Clinical method SC: Risk factors After treatment Marianne Lykke

  8. Semantiske component model Marianne Lykke

  9. Marianne Lykke

  10. Marianne Lykke

  11. Case study • sundhed.dk: Danish, national health portal • Active since 2001, 25.000 documents • Two main target groups: citizens and medical professionals • Combination of full-text indexing and controlled, assigned indexing: • ICPC, International Classification Primary Care • ICD-10, International Classification of Diseases • Home-grown Citizens Thesaurus • Large and varied group of indexers • 5 regions • Up to 250 indexers per region • Specific target group: family doctors Marianne Lykke

  12. Test design • Comparative, experimental indexing study • Baseline: keyword indexing (controlled and free terms) • Experimental: semantic component indexing • Test persons: 16 sundhed.dk indexers (convenience sample) • Indexing task: 12 sundhed.dk documents • 6 documents were indexed with semantic components (SC) • 6 documents were indexed with keywords • Random assignment of documents and indexing methods • Training session • Evaluation measures: • Accuracy • Consistency • Indexing time • Easiness Marianne Lykke

  13. Research questions • Is semantic component indexing more accurate than keyword indexing compared to a reference standard? • Is semantic component indexing more consistent than keyword indexing? • Is semantic component indexing faster than keyword indexing? • Is semantic component indexing easier than keyword indexing? Marianne Lykke

  14. Accuracy Marianne Lykke

  15. Consistency Marianne Lykke

  16. Time to index

  17. Easiness

  18. Conclusions • Varied accuracy for both indexing methods, but data suggests that semantic component indexing might be more accurate • Indications that feasibility and easiness of indexing methods are similar • Semantic component indexing may be preferable alternative if no appropriate controlled vocabulary is available due to short time for development and easy customization to specific document collection • Limitations: • Small sample and a single domain • Not directly comparable evaluation measure • Retrieval test shows improvement of document ranking of 25.6% by nDCG (normalized Discounted Cumulative Gain) Marianne Lykke

  19. Future research • Development of model: • Simpler version • Up-marking by users (social tagging) • Automatic up-marking • Up-marking by XML • Larger scale evaluation • Evaluation in other domains Marianne Lykke

  20. Litteratur Dillon, M (1991). Reader’s model of text structures: the case of academic articles. International Journal of Man-Machine Studies, 35. 913 – 925. Ely, J, Osheroff, J, Ebell, M, Bergus, G, Levy, B Chambliss, M & Evans, E (1999). Analysis of wquestions asked by family doctors regarding patient care. BMJ, 310 (7206). 358 – 361. Ely, J, Osheroff, J, Gorman, P, Ebell, M, Bergus, G, Levy, B Chambliss, M, Pifer, E & Stavri, P (2000). A taxonomy of generic clinical questions: classification study. BMJ, 321 (7278). 429 - 432. Fagin, R., Kumar, R., McCurley, K S., Novak, J., Sivakumar, D., Tomlin, J.A. & Williamson, D.P. (2003). Searching the workplace web. In: Proceedings of the 12th International World Wide Web Conference (WWW ’03), Budapest, Hungary, May 20-24, 2003. 366-375. Freund, L., Toms, E. & Waterhouse, J. (2005). Modeling the information behaviour of software engineers using a work-task framework. In: Grove, A (ed.) ASIS&T ’05 Proceedings of the 68th Annual meeting, Charlotte, NC, October 28-ember 2, 2005. Hearst, M & Plaunt, C (1993). Subtopic structuring for full length document access. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 59 – 69. Leckie, G.J., Pettigrew, K.E. & Sylvain, C. (1996). Modeling the information seeking of professionals. Library Quarterly, 66 (2). 161-193. Orlikowaki, W J & Yates, J (1994). Genre repertoire: the structuring of communicative practices in organizations. Administrative Science Quarterly, 39. 541 – 574. Price, S, Delcambre, L & Nielsen, M L (2006). Using semantic components to express questions against document collections. Proceedings International Workshop on Health Information and Knowledge Management (HIKM 2006), Arlington (VA). Price, S, Nielsen, M L, Delcambre, L & Vedsted, P (2007). Semantic components enhance retrieval of domain-specific documents. Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM), Lisboa, November 6 - 8, 2007. Marianne Lykke

  21. Search term Search term should appear in specified semantic component Marianne Lykke

  22. Semantic component should appear in document Marianne Lykke

  23. Time to index Time required for indexing documents

  24. Research team General practice Peter Vedsted MD, Ph.D. Research Unit general Practice, Århus University Jens Rubak MD Praksis.dk, Region Midt Information and computer science Lois Delcambre, Ph.D., Professor Susan Price, MD, Ph.D. student Computer Science Department Portland State University, USA Marianne Lykke, Ph.D., Associate professor Information Interaktion and Information Arkitecture Danmarks Bibliotekskole sundhed.dk Vibeke LukFrans la Cour Information specialist IT consultant sundhed.dk Autonomy Supported by grants from the National Science Foundation, grant numbers 0514238, 0511050 and 0534762, the National Library of Medicine Training Grant 5-T15-LM07088 and Kvalitetsudviklingsudvalget for Almen Praksis, Aarhus Amt Marianne Lykke

More Related