220 likes | 345 Vues
Instant Indexing. Greg Lindahl CTO, Blekko. October 21, 2010 - BCS Search Solutions 2010. Blekko Who?. Founded in 2007, $24m in funding Whole-web search engine Currently in invite-only beta 3B page crawl innovative UI … but this talk is abut indexing. What whole-web search was.
E N D
Instant Indexing Greg Lindahl CTO, Blekko October 21, 2010 - BCS Search Solutions 2010
Blekko Who? Founded in 2007, $24m in funding Whole-web search engine Currently in invite-only beta 3B page crawl innovative UI … but this talk is abut indexing
What whole-web search was Sort by relevance only News and blog search done with separate engines Main index updated slowly with a batch process Months to weeks update cycle
What web-scale search is now Relevance and date sorting Everything in a single index Incremental updating Live-crawled pages should appear in the main index in seconds All data stored as tables
Instant Search Indexing /date screnshot
Google’s take on the issue Daniel Peng and Frank Dabek, Large Scale Incremental Processing Using Distributed Transactions and Notifications “Databases do not meet the storage or throughput requirements of these tasks… MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.”
Percolator details ACID, with multi-row transactions triggers ("observers"), can be cascaded crawler is a cascade of triggers: MapReduce writes new documents into bigtable trigger parses and extracts links cascaded trigger does clustering cascaded trigger exports changed clusters 10 triggers total in indexing system max 1 observer per column for complexity reasons message collapsing when there are multiple updates to a column
Blekko’s take on this We want to run the same code in a mapjob or in an incremental crawler/indexer Our bigtable-like thingie shouldn’t need a percolator-sized addition to do it Needs to be more efficient than other approaches OK with non-ACID, relaxed eventual consistentcy, etc
Combinators Task: gather incoming links and anchortext Each crawled webpage has dozens of outlinks Crawler wants to write into dozens of inlists, each in a separate cell in a table TopN combinator: list of N highest-ranked items If a cell is frequently written, writes can be combined before hitting disk
Combining combinators Combine within the writing process Combine within the local write daemon Combine within the 3 disk daemons, and the ram daemon highly contented cells result in 1 disk transaction per 30 seconds Combinators are represented as strings and can be used without the database Using combinators seems to be a significant reduction of RPCs over Percolator, but I have no idea what the relative performance is.
TopN example table: /index/32/url row: pbm.com/~lindahl/ column: inlinks a list of: rank, key, data 1000, www.disney.com, “great website” 540, britishmuseum.com/dance, “16th century dance manuals in facsimile” 1, www.ehow.com/dance, “renaissance dance”
MapReduce from a combinator perspective MapReduce is really map, shuffle, reduce input: a file/table, output: a file/table An incremental job to do the same MapReduce looks completely different; you have to implement the shuffle+reduce Could write into BigTable cells…
MapJobs+Combinators Map function runs on shards All output is done by writing into a table, using combinators The same map function can also be run incrementally on individual inputs The shuffle+reduce is still there, it’s just done by the database+combinators
Combinator types topN lastN = topN, using time as the rank sum, avg, eavg, min, max counting things logcount: +- 50% count of strings in 16 bytes set -- everything is a combinator Cells in our tables are native Perl/Python data structures hence: atomic updates on a sub-cell level
Combinators for indexing The basic data structure for search is the posting list: for each term, a list with rows docid, rank Sounds like a custom topN to us rank = rank or date or … lists heavily compressed Each posting list has N shards
Combinators for crawling Pick a site, crawl the most important uncrawled pages that’s stored as a topN (the “livecrawl” uses other criteria) Crawl, parse, and spew writes outlinks into inlinks cells page ip/geo into incoming ips, geos page hashes into duptext detection table count everything under the sun 100s of writes total
Instant index step Crawler does the indexing Decides which terms to index based on page contents and incoming anchortext Writes into posting lists if indexed before, use list of previously indexed terms to delete any obsolete terms Heavily-contented posting lists are not a problem due to combining -- that’s how a naked [/date] query works.
Supporting date queries /date queries fetch about 3X the posting lists of a relevance query to support [/health /date], we keep a posting list of the most recent dated pages for each website date needs some relevance; every date-sorted posting list has a companion date-sorted lists of only highly-relevant articles
Example: [obama /date] The term posting list for ‘obama’ has overflowed -- moderately relevant dated queries are probably smushed out The date posting list for ‘obama’ has overflowed The date posting list for highly-relevant dated ‘obama’ is not full
To Sum Up There’s more than one way to do it yes, we use Perl I don’t think Blekko’s scheme is better or worse than Google’s, but at least it’s very different See me if you’d like an invite to our beta-test