search engines fdm 20c introduction to digital media lecture 04.12.2008

search engines fdm 20c introduction to digital media lecture 04.12.2008 warren sack / film & digital media department / university of california, santa cruz

outline • what’s the difference between a web directory and a search engine? • ontologies • clusters • what does a search engine do? • crawling • indexing • retrieving

what do search engines do? • crawl the web • index the pages found • retrieve pages from the index in response to user queries

crawling/spidering the web

depth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

breadth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

indexing web pages found by the crawler • tokenization: break the document into words, punctuation, tags, and uris • stop words: remove any that are too frequently used to be informational (these are called stop words) • stemming: find the stem of each word • inverted index: create an index linking each word found to one or more documents

tokenization • T’was was a dark and stormy night in the country manor. ==> • T single-quote was a dark and stormy night in the country manor period

stop words • T single-quote was a dark and stormy night in the country manor period ==> • T single-quote was adarkandstormynightin thecountrymanorperiod

stemming • dark stormy night country manor ==> • dark stormy night country manor

document vectors • documents (i.e., web pages) are stored as vectors or lists; e.g., country dark manor night storm • note that syntactic structure is lost

inverted index • dictionary: make a file listing all of the words found in all of the web pages crawled • postings: for each word, make a list of all of the web pages in which the word was found • encoding: remove any redundancies found in the dictionary and postings files so that they are as small as possible

inverted index / dictionary • create a document vector for each web page • combine all of the vectors together • remove any duplicate words • alphabetize the resulting list

inverted index / dictionary / example • webpage 1: T’was a dark and stormy night in the country manor. ==> country dark manor night storm • webpage 2: Now is the time for all good men to come to the aid of their country. ==> aid all come country good men time • dictionary: aid all come country dark good manor men night storm time

inverted index / postings / example • postings: aid (2) all (2) come (2) country (1&2) dark (1) good (2) manor (1) men (2) night (1) storm (1) time (2) • the postings file just annotates the dictionary file with the documents in which the words appear

retrieval • the postings file can support boolean queries • for example, searching for webpages that contain the word “country” returns 1&2. • for example, searching for the webpages that contain the words “country” and “manor” returns webpage 1

better retrieval • how should the webpages retrieved be ordered? • e.g., if webpage 1 and webpage 2 both contain the word “country” should 1 be listed before or after 2?

social networks and centrality Diane is central; Jane is not. See www.orgnet.com/sna.html

google’s page rank algorithm • basic idea: if a webpage has a lot of other webpages that link to it, then it is central and probably an important page. • If page A has pages T1...Tn which link to it (i.e., are citations) and C(A) is defined as the number of links going out of page A, then PageRank (PR) of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • where the parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85.

the difference is in the details • The Boolean approach to information retrieval is the most widely used. Links to documents are returned only if they contain exactly the same words as your query. Even if a site is related in some way, it would never be retrieved unless it contains the user's search string. In order to get better results from this, most search engines finesse things a little. The results from the inverted file are modified by factors such as: where the search term appears in the item - whether the string of characters appears in specially tagged areas such as titles or head, or whether it appears early in the body text of the html document; how close the search words are together, or whether they form an 'exact phrase'; the frequency of occurrence of search terms in the inverted file; whether the search terms appear as keywords in the metatags in the html file's head; how frequently the terms appear in relation to the document's length (greater frequency indicating a probable greater relevance). These techniques gradually move over into semantic analysis. For instance selection by the frequency of occurrence of words in general in which very uncommon words get heavier weight. Matthew Fuller

search engines fdm 20c introduction to digital media lecture 04.12.2008