Search Technologies

Search Technologies

Examples • Fast Google Enterprise • Google Search Solutions for business • Page Rank • Lucene • Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java • Solr • Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

Search Engine Ranking Criteria

Yahoo! • been in the search game for many years. • is better than MSN but nowhere near as good as Google at determining if a link is a natural citation or not. • has a ton of internal content and a paid inclusion program, both of which give them incentive to bias search results toward commercial results • things like cheesy off topic reciprocal links still work great in Yahoo!

MSN (bing) • new to the search game • is bad at determining if a link is natural or artificial in nature • due to sucking at link analysis they place too much weight on the page content • their poor relevancy algorithms cause a heavy bias toward commercial results • likes bursty recent links • new sites that are generally untrusted in other systems can rank quickly in MSN Search • things like cheesy off topic reciprocal links still work great in MSN Search

Google • has been in the search game a long time, and saw the web graph when it is much cleaner than the current web graph • is much better than the other engines at determining if a link is a true editorial citation or an artificial link • looks for natural link growth over time • heavily biases search results toward informational resources • trusts old sites way too much • a page on a site or subdomain of a site with significant age or link related trust can rank much better than it should, even with no external citations • they have aggressive duplicate content filters that filter out many pages with similar content • if a page is obviously focused on a term they may filter the document out for that term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier • crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. • things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost

Ask • looks at topical communities • due to their heavy emphasis on topical communities they are slow to rank sites until they are heavily cited from within their topical community • due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic

History • SMART • Salton’s Magic Information Retrieval of Text • Vector Space Model • Relevance feedback algorithm (customization) • Latent Semantic Indexing (LSI)

Basic Vector Space Algo • Vanilla Search Algo • Key word search (ignore search modifiers e.g. not, and, this, their, is, or, of, and stop words • Remove punctuation marks • Reduce words to their root form (stemming) • Combination of suffix and prefix • Eg: students  student swam  swim lemmatization  stochastic algorithmscience, scientist??

Documents to be indexed • Document 1 • Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.

Document 2 • Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.

Document 3 • Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Stop words for removal • Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone. • Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science. • Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Stemming Changes Identified • search technology around forty years time user base expanded first science technology information professionals finally information professionals pretty much everyone • math physics students familiar challenge finding unambiguous right answer information retrieval finding right document much art science • many serial killers suffer psychosis appear normal search killers take years latest police technology results shocking

Unique words identified • Search[1] technology[2] around[3] forty[4] year[5] time[6] user[7] base[8] expand[9] first[10] science[11] technology[2] information[12] professional[13] final[14] information[12] professional[13] pretty[15] much[16] everyone[17] • math[18] physics[19] student[20] familiar[21] challenge[22] find[23] unambiguous[24] right[25] answer[26] information[12] retrieval[27] find[23] right[25] document[28] much[16] art[29] science[11] • many[30] serial[31] killer[32] psychosis[33] appear[34] normal[35] search[1] killer[32] take[36] year[5] latest[37] police[38] technology[2] result[39] shock[40]

Search Dictionary [1] search [2] technology [3] around [4] forty [5] year [6] time………[40] shock

Representing documents as 40-dimensional vectors • Values are in form of <dictionary ref>:<no of occurrences> • Doc1(1:1, 2:2, 3:1, 4:1, 5:1, 6:1, 7:1,….,13:2,14:1, 15:1,…, 17:1, 18:0, 19:0,…,40:0) • Doc2(1:0, 2:0, 3:0,…,11:1,12:1,…,16:1,17:0,18:1, 19:1, 20:1,..,29:1,30:0,31:0,….,40:0) • Doc3(1:1,2:1,3:0,4:0,5:1,6:0,7:0,8:0,…,29:0, 30:1,31:2,32:2,33:1…,40:1)

Handling the Query • “the promise of search technologies” • the promise of search technology • search and technology are present in dictionary, but “promise” is not so it will be avoided • Hence the search becomes search technology, which is equivalent to (1:1, 2:1)....creating a new vector • Converting it to 40 dimensional array (1:1, 2:1, 3:0, 4:0,….,40:0) • Finally find the shortest distance (best match) between previously stored vectors.

Enhancements • Weighting multiple occurrences • (1:1000, 2:1000) • Weighting for phrases • Search technology • Police technology • Information professional • Information retrieval • Word clustering • Search/retrieval/find • Technology/science/math/physics • First/final/latest • Custom biases

Google PageRank • PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. • Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. • PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.

PageRank • A PageRank results from a mathematical algorithm based on the graph created by all World Wide Web pages as nodes and hyperlinks, taking into consideration authority hubs like Wikipedia. • The rank value indicates an importance of a particular page. • A hyperlink to a page counts as a vote of support.

Google Page ranking • PR(A) = (1-d)+d (PR(T1)/C(T1) + ….. + PR(Tn)/C(Tn)) A page in question T1…Tn documents that reference PR page rank C(Ti)  total number of links to outside resources on page Ti d  heuristic damping factor usually set to 0.85

Content is not taken into account when PageRank is calculated. • Not all links weight the same when it comes to PR. • If you had a web page with a PR8 and had 1 link on it, the site linked to would get a fair amount of PR value. But, if you had 100 links on that page, each individual link would only get a fraction of the value. • Bad incoming links don’t have impact on Page Rank. • Ranking popularity considers site age, backlink relevancy and backlink duration. PageRank doesn’t. • PageRank does not rank web sites as a whole, but is determined for each page individually. • Each inbound link is important to the overall total. Except banned sites, which don’t count. • PageRank values don’t range from 0 to 10. PageRank is a floating-point number. • Each Page Rank level is progressively harder to reach. PageRank is believed to be calculated on a logarithmic scale. • Google calculates pages PRs permanently, but we see the update once every few months (Google Toolbar).

Frequent content updates don’t improve Page Rank automatically. Content is not part of the PR calculation. • High Page Rank doesn’t mean high search ranking. • DMOZ and Yahoo! Listings don’t improve Page Rank automatically. • .edu and .gov-sites don’t improve Page Rank automatically. • Sub-directories don’t necessarily have a lower Page Rank than root-directories. • Wikipedia links don’t improve PageRank automatically (update: but pages which extract information from Wikipedia might improve PageRank). • Links marked with nofollow-attribute don’t contribute to Google PageRank. • Efficient internal onsite linking has an impact on PageRank. • Links from and to high quality related sites have an impact on Page Rank. • Multiple votes to one link from the same page cost as much as a single vote.

Web Spiders • Selection policy • Re-visit policy

Search Technologies