London Amazon CloudSearch Meetup Jon Handler - PowerPoint PPT Presentation

london amazon cloudsearch meetup jon handler n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
London Amazon CloudSearch Meetup Jon Handler PowerPoint Presentation
Download Presentation
London Amazon CloudSearch Meetup Jon Handler

play fullscreen
1 / 53
London Amazon CloudSearch Meetup Jon Handler
191 Views
Download Presentation
hanzila
Download Presentation

London Amazon CloudSearch Meetup Jon Handler

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. London Amazon CloudSearch MeetupJon Handler

  2. Agenda • CloudSearch technical overview (Jon Handler, Amazon CloudSearch Solution Architect) • NakedWines and CloudSearch (Matt Reid, Developer at NakedWines) • Searching Wikipedia with Amazon CloudSearch (Iain Fletcher, Search Technologies) • Building UI with CloudSearch (Stefan Olafsson, Co-Founder, Twigkit)

  3. What is Search Shoes

  4. Do You Want Search With That? • Build your own – database, home-rolled, site search • Open source • Legacy enterprise search

  5. Search Challenges • Complex, expertise required • Costly, often with up-front expenditure • Long time to market, innovation and experimentation are slowed • Operational overhead is undifferentiated work

  6. Amazon CloudSearch • Pay for infrastructure you need when you need it • Low cost • No need to guess capacity • Experiment fast with low risk • We do the undifferentiated heavy lifting • Go global in minutes

  7. Amazon CloudSearch Architecture AWS Query DNS / Load Balancing Search Domain Command Line Tools Doc Svc API Command Line Tools Config API Search API Console Console Console DOCUMENT SERVICE SEARCH SERVICE CONFIG SERVICE

  8. Automatic Scaling DATA Document Quantity and Size SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 TRAFFIC Search Request Volume and Complexity SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n

  9. Compute Storage Load Balancing Security SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n

  10. Text Search

  11. Highly Relevant Results

  12. Faceted Drilldown

  13. Integer Range Searching

  14. Complex Queries

  15. Search Query Processing 564 726 Query 726 564 123 123 Matching Filtering Ranking Sorting

  16. Reference Architecture

  17. Create An Amazon CloudSearch Domain

  18. Text fields for matching user terms Result enabled to retrieve source data

  19. Literal fields for Faceting Facet enabled to retrieve facets Search enabled for narrowing

  20. Integer fields for ranking, narrowing

  21. Configure the Domain

  22. Data Preparation and Upload SDF Batch Extract POST Amazon CloudSearch Search Documents

  23. CloudSearch SDF [{"type":"add", "id": "b007oznzg0", "version": 1, "lang": "en", "fields": { "title":"KindlePaperwhite", "description":"World's most advanced e-reader", "category": ["Electronics","eBook Readers"], "price":11900 } }, ...]

  24. Document Service API http(s)://< document service endpoint >/2011-02-01/documents/batch Accept: application/json Content-Length: 1176 Content-Type: application/json Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com [{"type": "add","id":"b007oznzg0","version": 1,"lang": "en","fields": {"title":"KindlePaperwhite","description":"World's most advanced e-reader","category":["Electronics","eBook Readers"],"price":11900} }, { "type": "delete", "id": "tt0434409", "version": 1337648735 } ]

  25. Search Service API http(s)://< search service endpoint>/2011-02-01/search? • Simple searches • q= text • Boolean combination of fields • bq= (or field:'value1' (and field:'value2' field:'value3')) • Faceting • facet= comma separated list of facet fields • Pagination • start=, size= • Customized ranking • rank= sort results based on the rank expression provided

  26. Search Results {"rank": "-text_relevance", "match-expr": "(label 'kindle paperwhite')", "hits": { "found": 204, "start": 0, "hit": [ { "id": "sontsst12cf5f88b42" }, { "id": "sopvopr12ab017f082" }, { "id": "sorzrpw12ac468a13b" }, ] }, ... }

  27. Customizing Ranking • Rank expressions • Compute a score for each document • &rank=<function> • E.g. recency based

  28. Customizing Ranking With Queries • Define rank expressions in your query • &rank-recency=text_relevance + (1 / (2012 - year)) * 100 • &rank=-recency • Uses • A/B testing • User-customized searches • Geo-searching

  29. IMDb Data Demo

  30. Pricing • Get started for just $2.40/day; $75/month • AWS Calculator http://calculator.s3.amazonaws.com/calc5.html Free Trial

  31. Wrap Up • Powerful search is a critical component of today's applications • Amazon CloudSearch makes adding search easy • Create a domain, POST documents, GET search results

  32. Resources and Q&A • Amazon CloudSearch Overview Page http://aws.amazon.com/cloudsearch/ • FAQs • Community Forum • Documentation & Getting Started Tutorial (IMDb) • Contact our EU business development team • http://aws.amazon.com/contact-us

  33. Thank You Jon Handler / handler@amazon.com

  34. Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com

  35. Search Engine Expertise • Microsoft SharePoint/FAST • Google Search Appliance • Solr • Amazon CloudSearch • LucidWorks • Attivio • Exalead • Autonomy • MarkLogic • elasticsearch • Vivisimo • Sinequa • Hadoop • Sphinx • …..

  36. 400+ Customers

  37. Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com

  38. Agenda • Project Background • High-level Architecture • Summary & Observations

  39. Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes

  40. High-level Architecture

  41. Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks

  42. Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013

  43. XML Input

  44. Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire

  45. Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html

  46. Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Hugely useful for analysis applications • So, what does it look like?

  47. wikipedia.searchtechnologies.com

  48. wikipedia.searchtechnologies.com