1 / 42

Searching for Search Solutions Harvard IT Summit June 23, 2011

Searching for Search Solutions Harvard IT Summit June 23, 2011. Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu | HUIT. Searching the Web. Searching a Site. Searching a Collection. Searching Geospatially. Search at Harvard – Web.

sunee
Télécharger la présentation

Searching for Search Solutions Harvard IT Summit June 23, 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Search SolutionsHarvard IT SummitJune 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu | HUIT

  2. Searching the Web

  3. Searching a Site

  4. Searching a Collection

  5. Searching Geospatially

  6. Search at Harvard – Web

  7. Search at Harvard – Web

  8. Search at Harvard – Collections People Courses Grants Libraries ....many other things…

  9. Search at Harvard – Libraries

  10. Search at Harvard – Federated

  11. Search Models • “To oversimplify, there's the Google model and the faceted navigation model.” – Morville & Callendar in Search Patterns • Keyword (“Google”) • Keyword search against an index • Advanced Search • Searching or selecting specific fields • Faceted Search (“Guided Navigation”) • Integrated search and browse • Keyword search • Browse by category metadata • “No dead ends”

  12. Advanced Search

  13. Advanced Search

  14. Faceted Search

  15. Search Technologies – Summary

  16. Apache Lucene Image goes here • Open source from Apache • High-performance, full-featured text search engine library written entirely in Java • Text-based inverted index • Documents of name/value pairs • Stemming and tokenizers for various applications and languages • Query syntax – and/or/not/near • Highlighter • **FAST**

  17. Apache Solr Image goes here http://lucene.apache.org/solr/ • “Solr is the popular, blazing fast open source enterprise search platform from Apache” • A REST Web Service on top of Lucene for indexing and querying • XML and JSON output • Caching for faster response • Faceting • Web management interface • XML schema configuration files • “did you mean?” and “more like this” support • Scalable server model • Very active development community

  18. Apache Solr/Lucene Ecology Library catalogs Enterprise databases Web Archives Image goes here Solr Solr Solr Nutch, Nutchwax Solr Lucene Lucene Lucene Lucene Highly scalable with Hadoop cluster Text Fielded data

  19. Solr Indexing Image goes here Indexing: HTTP POST to http://mysolrserver/solr/update <add> <doc> <fieldname="id">13579</field> <fieldname="title">Mona Lisa</field> <fieldname="creator">Leonardo DaVinci</field> <fieldname="year">1519</field> <fieldname="genre">painting</field> </doc> </add>

  20. Solr Searching Image goes here http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre <response> <resultnumFound=“43” start="0"> <doc> <strname=“title">Mona Lisa</str> <strname=“genre”>painting</str> </doc> <doc> <strname=“title">Bronze Horse</str> <strname=“genre”>sculpture</str> </doc> </result> </response>

  21. Solr Searching Image goes here http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre&wt=json {"response" : { "numFound" : 43, "start" : 0, "docs" : [ {"title":"Mona Lisa", "genre":"painting"}, {"title":"Bronze Horse", "genre":"sculpture"} ] } }

  22. Use of Solr Exploding Image goes here Source: http://wiki.apache.org/solr/PublicServers • Whitehouse.gov, FCC.gov, Comcast / xfinity, AT&T Interactive, AOL (Yellow Pages, Music, NFL Sports, Recipes), Sears, Ticketmaster, Digg, Netflix, Zappos.com, and many more • Open source library catalogs • Blacklight (Ruby), VuFind (PHP) • Open source digital Repositories • Fedora, Dspace • Support available from Lucid Imagination (Solr creators)

  23. Harvard University Course Catalog coursecatalog.harvard.edu

  24. Solr & Course Catalog • 9,000+ courses from 13 schools/programs • 15 Mb index size • fields are indexed and stored • Search + Faceted Navigation • School, calendar period, term, department, day, time, cross-registration status, credit level • Updated daily • REST interfaceHTTP post of XML files • XSLT/XPath 2 processing of XML data from Solr

  25. Course Catalog – Searching and Facets Search Terms Facets

  26. Course Catalog • Access to data to other applications • Open Search browser plugins

  27. iSites 5,500 course websites each year 20,000 websites 16,000 students 8 student portals 33,000 users on a peak day

  28. Search within iSites

  29. Solr & iSites • 4.5 million items • File, topic, forum, image, page, html, sign-up event, video, audio, site, link, wiki, announcement, podcast • Crawlers use database and file system • MS Office, PDF, Audio (metadata), OpenDocument, RTF, Text, HTML, XML • 35 Gb index size • Updated hourly • Master and slave • Search Tool - Permissions

  30. Search – New Ways of Navigating

  31. Harvard Library Full Text Search Service .

  32. Harvard Library Full Text Search Service .

  33. Full Text Search Service Uses Lucene directly Full text index of OCR page text for digitized books and other page turned objects Relevance ranked searching Hits in context ~81,000 objects so far, 7.2 million pages Index size 8.5GB

  34. Harvard Library Web Archiving Service .

  35. Harvard Library Web Archiving Service .

  36. Web Archiving Service • Lucene plus Nutchwax full text index of harvested web pages and harvested resources • Indexing HTML, PDFs, Word docs, PPTS, etc. and collection metadata • Currently a “small” web archive • 265 web sites • 13M web pages • 100M web resources, 1TB of archived web data • Index size 170GB and growing • 80-90% of index size is full text required for “hit in context” search results • 3-5 sec search result times on ordinary dual core Linux box

  37. DRS 2 Web Administrator Facets to come!! .

  38. DRS 2 Web Administrator Footer reference – remove hyperlink if you want to keep this gray. • Solr for digital object management searching • Digital preservation objects have many fields that may be important for collection management or preservation planning • Faceted browse – by user tags, content type, owners, etc. • Full text searching for descriptions and process info • Easy to configure, update, and use (HTTP and simple URLs)  • Indexing metadata plus full text embedded in object descriptors, rather than the content of files themselves • Scoped at release: • 152 fields • 30 million records, index size of 60GB • master/slave configuration

  39. Email Archiving Service .

  40. Email Archiving Service • Why Solr for email object management? • relevance ranking • Facets • full text searching of both email body and header fields  • Indexing email header fields, rights and collection metadata, plus full text from emails

  41. Searching for Search Solutions • Integrating multiple forms of data (text, images, audio, maps, etc.) into single searchable indexes • Aggregating Indexes • Google, Google Books, Google Scholar • Licensed cloud services for articles, books, media, everything • Library Cloud • DPLA • Semantic Web • Linked Data, RDF, HTML 5’s Microdata, Microformats • Mobile (Localized) • Specialized search vs. general search – there’s an app for that

  42. Thank You Randy Stern | randy_stern@harvard.edu | HUL David Heitmeyer | david_heitmeyer@harvard.edu | HUIT

More Related