1 / 116

Outline

Indexing and searching heterogeneous information LLNL – Nov. 3, 2006 Edward A. Fox Virginia Tech fox@vt.edu http://fox.cs.vt.edu. Outline. Acknowledgements, Publications Introduction: Problem, Digital Libraries New Efforts: Personalization, Superimposed Info 5S, ETANA, Structure

Télécharger la présentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing and searching heterogeneous informationLLNL – Nov. 3, 2006Edward A. FoxVirginia Techfox@vt.eduhttp://fox.cs.vt.edu

  2. Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary

  3. Acknowledgements: Students • Pavel Calado, William Cameron, Yuxin Chen, Fernando Das Neves, Robert France, Marcos Gonçalves, S.H. Kim, Aaron Krowne, Ming Luo, Paul Mather, Fernando Das Neves, Sanghee Oh, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Manas Tungare, Wensi Xi, Seungwon Yang, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …

  4. Acknowledgements: Faculty, Staff • Lillian Cassel, Lois Delcambre, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, David Maier, Gail McMillan, Claudia Medeiros, Manuel Perez-Quinones, Jeffrey Pomerantz, Naren Ramakrishnan, Layne Watson, Barbara Wildemuth, …

  5. Other Collaborators (Selected) • Brazil: FUA, UFMG, UNICAMP • Case Western Reserve University • Emory, Notre Dame, Oregon State • Germany: Univ. Oldenburg • Mexico: UDLA (Puebla), Monterrey • College of NJ, Hofstra, Penn State, Villanova • University of Arizona • University of Florida, Univ. of Illinois • University of Virginia

  6. Acknowledgements: Support • ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601, 0435059, 0532825), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS

  7. Publications – 1 of 2 • N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. Combining the Evidence of Multiple Query Representations for Information Retrieval. Information Processing & Management, 31(3), 431-448, May-June 1995. • Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. Tuning before feedback: Combining ranking discovery and blind feedback for robust retrieval. SIGIR 2004, 27th Annual Int’l ACM SIGIR Conf. on R&D in Information Retrieval, Sheffield, England, 25-29 July • Weiguo Fan; Gordon, M.D.; Pathak, P.; Wensi Xi; Fox, E.A.; Ranking function optimization for effective web search by genetic programming: an empirical study, in the Proceedings of 37th Hawaii International Conf. on System Sciences (HICSS), 5-8 Jan. 2004, 105 - 112 • Edward A. Fox, Fernando Das Neves, Xiaoyan Yu, Rao Shen, Seonho Kim, and Weiguo Fan. Exploring the computing literature with visualization and stepping stones & pathways. CACM 49(4): 52-58, April 2006 • Edward A. Fox and Paul Mather. Scalable Storage for Digital Libraries. Chapter 12 in Multimedia Information Retrieval and Management: Technological Fundamentals and Applications, eds. D. Feng, W.C. Siu and H.J. Zhang, Berlin: Springer, 2003, pp. 265-288 • E. Fox and J. Shaw. Combination of Multiple Searches. In Proc. of The Second Text REtrieval Conference (TREC-2) (Aug. 30 - Sept. 1, 1993, NIST, Gaithersburg, MD), NIST Special Pub. 500-215, 1994, ed. D. K. Harman, 243-252 • Marcos Andre Goncalves, Robert K. France, and Edward A. Fox, MARIAN: Flexible Interoperability for Federated Digital Libraries. In Proc. 5th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'2001, September 4-8, 2001, Darmstadt, Germany, Springer, LNCS 2163 / 2001, pp. 173-186 • Ananth Raghavan, Naga Srinivas Vemuri, Rao Shen, Marcos Andre Goncalves, Weiguo Fan, and Edward A. Fox. Incremental, Semi-automatic, Mapping-Based Integration of Heterogeneous Collections into Archaeological Digital Libraries: Megiddo Case Study. In Proc. ECDL2005, Vienna, Sept. 18-23, 2005, 139-150

  8. Publications – 2 of 2 • Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, Ricardo da S. Torres, and Edward A. Fox. Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization. In Proc. JCDL 2006, June 11-15, 2006, Chapel Hill, NC, 1-10 • Ricardo da Silva Torres, Alexandre X. Falcao, Baoping Zhang, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Pavel Calado. A new framework to combine descriptors for content-based image retrieval. In Proc. 14th Conf. Information and Knowledge Management, CIKM 2005, 31 Oct. - 5 Nov. 2005 Bremen, Germany, 335-336 • Li Wang, Weiguo Fan, Rui Yang, Wensi Xi, Ming Luo, Ye Zhou, Edward A. Fox, Ranking Function Discovery by Genetic Programming for Robust Retrieval, Text Retrieval Evaluation Conference-2003, Nov 17-23, NIST, Washington DC, 9 pages • Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, Dong Zhuang. SimFusion: Measuring Similarity using Unified Relationship Matrix. In Proc. SIGIR 2005, 28th Annual International ACM SIGIR Conf., Salvador, Brazil, August 15-19, 2005, 130-137, http://doi.acm.org/10.1145/1076034.1076059 • W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.Y. Ma, E.A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects. In Proc. Thirteenth International World Wide Web Conf., WWW2004, NY, U.S.A. 19-22 May 2004, 10 pages • Wensi Xi, Ohm Sornil, Ming Luo, and Edward A. Fox. Hybrid Partition Inverted Files: Experimental Validation. In "Research and Advanced Technology for Digital Libraries, 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings", eds. Maristella Agosti and Constantino Thanos, LNCS 2458, Springer, pp. 422-431. • Wensi Xi, Ohm Sornil, and Edward A. Fox. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, July 9-11, 2002, Beijing Library Press, Beijing, China, 404-418 • Baoping Zhang, Yuxin Chen, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Marco Cristo, Pavel Calado. Intelligent GP Fusion from Multiple Sources for Text Classification. In Proc. 14th Conf. on Information and Knowledge Management, CIKM 2005, 31st October - 5 Nov 2005 Bremen, Germany, 477-484

  9. Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary

  10. Problem Characterization • Distributed (space) • Content (streams) • Indexing (space, structure) • Features • Type/sub-type: Image, texture; link, citation • Descriptors: words or phrases or concepts • High dimensionality • Searching (scenario)

  11. Efficiency / Effectiveness • Effectiveness • Very common measures: Precision, Recall, F1, 10-precision, R-Precision • Usefulness, usability, task support, … • Efficiency • Time • Space • Performance, Resource use, …

  12. * Core components

  13. DL Curriculum Framework

  14. Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary

  15. Personalizing A Course Website Using the NSDL William Cameron2, Boots Cassel2, Edward Fox1, Manuel Perez-Quinones1, Manas Tungare1, Xiaoyan Yu1 Virginia Tech1, Villanova2

  16. Syllabus Collection …Towards an intelligent educational system Publisher Recommender Searcher Editor Services Potential Syllabus Text Other NSDL Resources Unstructured Syllabus Text Structured Syllabus Text Syllabus Ontology Syllabus Classifier Classification Scheme Extractor Resource Classifier Crawler

  17. Search • With collection, we have a full text search • Results point to local copy in our collection as well as to original document • Try it out http://doc.cs.vt.edu/search/

  18. Syllabus Ontology • Standard, machine understandable • Ontology Editor: Protégé • Syllabus Schema: SylVia • http://doc.cs.vt.edu/ontologies/

  19. Creating new syllabus • Web-based application to support entry of syllabi into collection • Moodle Plug-in in the works • Uses CC 2001 to select topics for a course

  20. Information Extraction • Plans to automatically extract information from syllabi documents collected • Rule-based Approach • Statistics-based Approach • Apply the best extractor on the unstructured syllabi

  21. Superimposed Tools for VT Uma Murthy and Edward A. Fox Department of Computer Science, Virginia Tech 18 October 2006

  22. Origin of SI • This basic need had been addressed in diverse ways, with varying degrees of success, for many years: • concordances, annotations, comments • bookmarks, concept maps, digital annotations, … • The term “SI” was coined in 1999 by researchers, currently collaborating with us, now at Portland State University • Lois Delcambre • David Maier

  23. Layers in an SI system * Source: ICDE04 presentation by Murthy, et. al

  24. Annotating an image

  25. Searching over annotations

  26. Searching over images/sub-images

  27. Summary * Source: ICDE04 presentation by Murthy, et. al

  28. Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary

  29. Informal 5S & DL DefinitionsDLs are complex systems that • help satisfy info needs of users (societies) • provide info services (scenarios) • organize info in usable ways (structures) • present info in usable ways (spaces) • communicate info with users (streams)

  30. 5Ss

  31. Taxonomy of DL Services

  32. 5S and DL formal definitions and compositions (April 2004 TOIS)

  33. Structures Societies Scenarios hypertext Streams indexing Spaces searching services Collection Repository browsing A Minimal DL in the 5S Framework Structured Stream Structural Metadata Specification Descriptive Metadata Specification Metadata Catalog Digital Object Minimal DL

  34. ETANA-DL • Archaeological DL • Integrated DL • Heterogeneous data handling • Applies and extends the OAI-PMH • Open Archives Initiative Protocol for Metadata Handling • Design considerations • Componentized • Extensible • Portable

  35. Heterogeneous data handling

  36. ETANA Spaces • Geographic distribution of found artifacts • Temporal dimension (as inferred by archaeologists) • Metric or vector spaces • used to support retrieval operations, and to calculate distance (and similarity) • used to browse / constrain searches spatially • 3D models of the past, used to reconstruct and visualize archaeological ruins • 2D interfaces for human-computer interaction

  37. ETANA Structures • Site Organization • Region, site, partition, sub-partition, locus, … • Temporal orderings (ages, periods) • Taxonomies • for bones, seeds, building materials, … • Stratigraphic relationships • above, beneath, coexistent

  38. ETANA Streams • successive photos and drawings of excavation sites, loci, unearthed artifacts • audio and video recordings of excavation activities and discussions • textual reports • 3D models used to reconstruct and visualize archaeological ruins.

  39. Degree of Structure Web DLs DBs Chaotic Organized Structured

  40. Digital Objects (DOs) • Born digital • Digitized version of “real” object • Is the DO version the same, better, or worse? • Decision for ETDs: structured + rendered • Surrogate for “real” object • Not covered explicitly in metamodel for a minimal DL • Crucial in metamodel for archaeology DL

  41. Metadata Objects (MDOs) • MARC • Dublin Core • RDF • IMS • OAI (Open Archives Initiative) • Crosswalks, mappings • Ontologies • Topics maps, concept maps

  42. Also Important: Epub, SGML, XML • 5S perspective: streams, structures, scenarios • Authoring • Rendering, presenting • Tagging, Markup, DOM • Semi-structured information • Dual-publishing, eBooks • Styles (XSL, XSLT) • Structured queries

  43. XML-based DL Log Standard • Log analysis • is a source of information on: • How patrons really use DL services • How systems behave while supporting user information seeking activities • Used to: • Evaluate and enhance services • Guide allocation of resources • Common practice in the web setting • Supported by web servers, proxy caches • DL Logging can be more detailed

  44. DL Logging Features • Captures high level user and system behaviors • Organized according to the 5S framework • Hierarchical organization (XML-based) • Centered on the notions of events • Record only events related to initial user inputs and final system outputs • Help to understand user interactions and the perceived value of responses

More Related