1 / 20

search engines fdm 20c introduction to digital media lecture 04.12.2008

search engines fdm 20c introduction to digital media lecture 04.12.2008. warren sack / film & digital media department / university of california, santa cruz. outline. what’s the difference between a web directory and a search engine? ontologies clusters what does a search engine do?

Télécharger la présentation

search engines fdm 20c introduction to digital media lecture 04.12.2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. search engines fdm 20c introduction to digital media lecture 04.12.2008 warren sack / film & digital media department / university of california, santa cruz

  2. outline • what’s the difference between a web directory and a search engine? • ontologies • clusters • what does a search engine do? • crawling • indexing • retrieving

  3. what do search engines do? • crawl the web • index the pages found • retrieve pages from the index in response to user queries

  4. crawling/spidering the web

  5. depth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

  6. breadth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

  7. indexing web pages found by the crawler • tokenization: break the document into words, punctuation, tags, and uris • stop words: remove any that are too frequently used to be informational (these are called stop words) • stemming: find the stem of each word • inverted index: create an index linking each word found to one or more documents

  8. tokenization • T’was was a dark and stormy night in the country manor. ==> • T single-quote was a dark and stormy night in the country manor period

  9. stop words • T single-quote was a dark and stormy night in the country manor period ==> • T single-quote was adarkandstormynightin thecountrymanorperiod

  10. stemming • dark stormy night country manor ==> • dark stormy night country manor

  11. document vectors • documents (i.e., web pages) are stored as vectors or lists; e.g., country dark manor night storm • note that syntactic structure is lost

  12. inverted index • dictionary: make a file listing all of the words found in all of the web pages crawled • postings: for each word, make a list of all of the web pages in which the word was found • encoding: remove any redundancies found in the dictionary and postings files so that they are as small as possible

  13. inverted index / dictionary • create a document vector for each web page • combine all of the vectors together • remove any duplicate words • alphabetize the resulting list

  14. inverted index / dictionary / example • webpage 1: T’was a dark and stormy night in the country manor. ==> country dark manor night storm • webpage 2: Now is the time for all good men to come to the aid of their country. ==> aid all come country good men time • dictionary: aid all come country dark good manor men night storm time

  15. inverted index / postings / example • postings: aid (2) all (2) come (2) country (1&2) dark (1) good (2) manor (1) men (2) night (1) storm (1) time (2) • the postings file just annotates the dictionary file with the documents in which the words appear

  16. retrieval • the postings file can support boolean queries • for example, searching for webpages that contain the word “country” returns 1&2. • for example, searching for the webpages that contain the words “country” and “manor” returns webpage 1

  17. better retrieval • how should the webpages retrieved be ordered? • e.g., if webpage 1 and webpage 2 both contain the word “country” should 1 be listed before or after 2?

  18. social networks and centrality Diane is central; Jane is not. See www.orgnet.com/sna.html

  19. google’s page rank algorithm • basic idea: if a webpage has a lot of other webpages that link to it, then it is central and probably an important page. • If page A has pages T1...Tn which link to it (i.e., are citations) and C(A) is defined as the number of links going out of page A, then PageRank (PR) of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • where the parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85.

  20. the difference is in the details • The Boolean approach to information retrieval is the most widely used. Links to documents are returned only if they contain exactly the same words as your query. Even if a site is related in some way, it would never be retrieved unless it contains the user's search string. In order to get better results from this, most search engines finesse things a little. The results from the inverted file are modified by factors such as: where the search term appears in the item - whether the string of characters appears in specially tagged areas such as titles or head, or whether it appears early in the body text of the html document; how close the search words are together, or whether they form an 'exact phrase'; the frequency of occurrence of search terms in the inverted file; whether the search terms appear as keywords in the metatags in the html file's head; how frequently the terms appear in relation to the document's length (greater frequency indicating a probable greater relevance). These techniques gradually move over into semantic analysis. For instance selection by the frequency of occurrence of words in general in which very uncommon words get heavier weight. Matthew Fuller

More Related