1 / 79

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

http://comet.lehman.cuny.edu/jung/presentation/presentation.html. Introduction to Modern Information Retrieval and Search Engines And Some Research Issues Professor Gwang Jung Department of Mathematics and Computer Science Lehman College, CUNY November 10, Fall 05. Outline.

varana
Télécharger la présentation

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://comet.lehman.cuny.edu/jung/presentation/presentation.htmlhttp://comet.lehman.cuny.edu/jung/presentation/presentation.html Introduction to Modern Information Retrieval and Search Engines And Some Research Issues Professor Gwang Jung Department of Mathematics and Computer Science Lehman College, CUNY November 10, Fall 05 Intro to IR and SE, research issues

  2. Outline Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues Intro to IR and SE, research issues

  3. Introduction to Information Retrieval Intro to IR and SE, research issues

  4. Information Age Intro to IR and SE, research issues

  5. IR in General • Information Retrieval in general deals with • Retrieval of structured, semi-structured and unstructured data (information items) in response to a user query (topic statement). • User query • Structured (e.g., Boolean expression of keywords or terms) • Unstructured (e.g., terms, sentence, document) • In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. • Efficiency with respect to: • Algorithms, Query processing, Data organization/structure • Effectiveness with respect to: • Retrieval results Intro to IR and SE, research issues

  6. IR Systems Intro to IR and SE, research issues

  7. Formal Definition of IR System • IRS = (T, D, Q, F, R) • T: set of index terms (terms) • D: set of documents in a document database • Q: set of user queries • F: D x Q R (retrieval function) • R: real numbers (RSV: Retrieval Status Value) • Relevance Judgment is given by users. Intro to IR and SE, research issues

  8. IRS versus DBMS Intro to IR and SE, research issues

  9. IR Systems Focus on Retrieval effectiveness • The effective retrieval of relevant information depends on • User task (formulating effective query for the information need) • Indexing • IR systems in general adopt index terms to represent documents and queries. • The process of developing document representations by assigning index terms to documents (information items). • Retrieval model (often called IR model) and logical view of documents • Logical view of documents (logical representation of documents) depends on IR model Intro to IR and SE, research issues

  10. Indexing • The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items). • Descriptors = index terms = terms • Descriptors also lead users to participate in formulating information requests. • Two types of index terms: • Objective: author name, publisher, date of publication • Subjective: keywords selected from full text • Two types of indexing methods: • Manual: performed by human experts (for very effective IR systems)– may use ontology • Automatic: performed by computer HW and SW Intro to IR and SE, research issues

  11. Indexing Aims (1) • Recall: the proportion of relevant items (documents) retrieved. • R = # of relevant items retrieved / total # of relevant items in the db • Precision: the proportion of retrieved documents that are relevant. • P = # of relevant items retrieved / total # of items retrieved • Effectiveness of indexing is mainly controlled by Term Specificity • Broader terms may retrieve both useful (relevant) and useless (non-relevant) info items for the user. • Narrower (specific) index terms favor precision at the expense of recall. • Index Language (set of well-selected index terms) • T = { index term t} • Pre-specified (controlled): easy maintenance; poor adaptability • Uncontrolled (dynamic): expanded dynamically; taken freely from the texts to be indexed and from the users’ queries. • Synonymous terms can be expanded to T by thesaurus, e-dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology). Intro to IR and SE, research issues

  12. Indexing Aims (2) • Recall and Precision values vary from 0 to 1. • Average users want to have high recall and high precision. • In practice, a compromise must be reached (middle point). R 1.0 P 0 1.0 Intro to IR and SE, research issues

  13. Steps for Indexing • Objective attributes of a document are extracted (e.g., title, author, URL, structure). • Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc). • Case insensitivity might be performed. • Stemming might be used. • Frequency of nonfunctional words are used to specify the term importance. • Term frequency weight fulfils only one of the indexing aims, I.e., Recall. • Terms that occur rarely in the individual document database may be used to distinguish documents in which they occur from those in which they do not occur  could improve Precision. • Document frequency: the number of documents in the collection in which a term tj T occurs Intro to IR and SE, research issues

  14. Dj, tfj, df Index terms D7, 4 3 computer database D1, 3 2    D2, 4 4 science system 1 D5, 2 Inverted Index File Inverted Index Entries Optionally postings (the positions of the term in a document) Intro to IR and SE, research issues

  15. Retrieval Models (1) • Set theoretic IR models • Documents are represented by a set of terms • Well known Set Theoretic Models • Boolean IR Model • Retrieval Function is based on Boolean operation (e.g., and, or, not) • Query is formulated by Boolean logic • Fuzzy Set IR Model • Retrieval function is based on Fuzzy set operations • Query is formulated by Boolean logic • Rough Set IR Model • Various set operations were examined. • Ad-hoc Boolean query • Probabilistic IR model • Mainly used for probabilistic index term weighting • Provides mathematical framework for the well known tf*idf indexing scheme • Language Model based • Infer query concept from a document as retrieval process Intro to IR and SE, research issues

  16. Retrieval Models (2) • Vector space model • Queries and documents are represented as weighted vectors. • Vectors in the basis are called term vectors, and assumed they are semantically independent. • A document (query) is represented as a linear combination of vectors in the generating set. • Retrieval function is based on dot product or cosine measure between document and query vectors. • Extended Boolean IR model • Combine characteristics of the vector space IR model with properties of Boolean algebra. • Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p Intro to IR and SE, research issues

  17. The Retrieval Process Intro to IR and SE, research issues

  18. The retrieval Process in IR System Intro to IR and SE, research issues

  19. Introduction to Search Engines (IR Systems for the Web) Intro to IR and SE, research issues

  20. World Wide Web History • 1965 – Hypertext • Ted Nelson developed idea of hypertext in 1965. • Late 1960’s • Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. • Early 1970’s • ARPANET was developed in the early 1970’s. • 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) • 1989- WWW • Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet. • Combined idea of documents available by FTP with the idea of hypertext to link documents. • Developed initial HTTP network protocol, URLs, HTML, and first web server. Intro to IR and SE, research issues

  21. Search Engine (Web-based IR System) History • By late 1980’s many files were available by anonymous FTP. • In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) • Assembled lists of files available on many FTP servers. • Allowed regular expression search of these file names. • In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers. • In 1993, early web robots (spiders) were built to collect URL’s: • Wanderer • ALIWEB (Archie-Like Index of the WEB) • WWW Worm (indexed URL’s and titles for regex search) • In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo. Intro to IR and SE, research issues

  22. Search Engine History (cont’d) • In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL). • A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. • First to use a standard IR system as developed for the DARPA Tipster project. • First to index a large set of pages. • In late 1995, DEC developed Altavista. • Used a large farm of Alpha machines to quickly process large numbers of queries. • Supported boolean operators, phrases, and “reverse pointer” queries. • In 1998 – Google was developed by graduate students Larry Page & Sergey Brin at Stanford U • use of link analysisto rank documents Intro to IR and SE, research issues

  23. How do Web SE Work? • Search Engines for the general web • search a database of the full text of web pages selected from billions of Web pages • searching is based on inverted index entries • Search Engine Databases • Full text documents are collected by software robot (also called softbot, spider). They navigate the web for collecting pages. • Web can be viewed as a graph structure. • The navigation can be based on DFS (Depth First Search), or BFS (Breadth First Search), or based on some combined navigation heuristics. • How to detect cycles?  research issue • Indexer then build inverted index entries stored them into inverted files. • If necessary the inverted files may be compressed. • Some types of pages & links are excluded from the search engine • form invisible Web (maybe many times bigger than the visible Web). Intro to IR and SE, research issues

  24. ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Breadth-First Crawling Intro to IR and SE, research issues

  25. ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Depth-First Crawling Intro to IR and SE, research issues

  26. Web Search Engine System Architecture Intro to IR and SE, research issues

  27. Robot User Internet Websites Interface Temporary storage Logical Document Representation (based on IR Models) Retrieval Mechanism Parser Inverted Files (can be based on Different physical data structures Stopper/Stemmer Indexer Intro to IR and SE, research issues

  28. Distributed Architecture (example) • Harvest (http://harvest.sourceforge.net/) • Distributedweb search engine • distribute the load among different machines • indexer doesn't run on the same machine as broker or web server Intro to IR and SE, research issues

  29. What Makes a SE Good? • Database of web documents • Size of database • Freshness (Recency or up-to-datedness) • Types of documents offered • Retrieval Speed • The search engine's capabilities • Search options • Effectiveness of the retrieval mechanism • Support Concept-based search  semantic web • Concept-based search systems try to determine what you mean, not just what you say. • Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform. • Presentation of the results • keywords highlighted in context • showing summary of the web page that match Intro to IR and SE, research issues

  30. Search Engine Example (Google) Intro to IR and SE, research issues

  31. Google • The most popular web search engine: • Crawls (by robots) the web, stores a local cache of found pages • Builds a lexicon of common words • For each word creates an index list of pages containing it • Also human-compiled information from the Open Directory • Cached links - let you see older versions of recently changed ones • Link Analysis system: • page rank heuristic • Estimated size of index • 580 million pages visited and recorded • Uses link data to get to another 500 million pages (by link analysis system) • Recent estimation is around 4 billion pages (??) • Index refresh • Updated monthly/weekly or daily for popular pages • Serves queries from three data centres (service replication) • Service updates are synchronized. • Two on West Coast of the US, one on East Coast. Intro to IR and SE, research issues

  32. Market share 50% Google Yahoo! 40% 30% 20% MSN 10% Lycos AOL AltaVista 0% 2001 2002 2003 2004 Source: WebSideStory Google Founders • Larry Page, Co-founder & President, Products • Sergey Brin, Co-founder & President, Technology • PhD students at Stanford • Became public co. last year Intro to IR and SE, research issues

  33. Google Architecture Overview Intro to IR and SE, research issues

  34. Google Indexer term frequencies Intro to IR and SE, research issues

  35. Google Lexicon Intro to IR and SE, research issues

  36. Google Searcher Intro to IR and SE, research issues

  37. Google Features • Combines traditional IR text matching with extremely heavy use oflink popularity to rank the pages it has indexed. • Otherservices also use link popularity, but none do to the extentthat Google does. • Traditional IR (LITE) • Link Popularity (HEAVYLY USED) • Citation Importance Ranking (Quality of links pointing at it) • Relevancy • Similarity between query and a page • Number of Links • Link Quality • Link Content • Ranking boosts on text styles • PageRank • Usage simulation & Citation importance ranking • User randomly navigates • Process modelled by Markov Chain Intro to IR and SE, research issues

  38. Collecting Links in Google • Submission (by Web Promotion): • Add URL page (may not need to do a "deep" submit) • The best way to ensure that your site is indexed is to build links. The more other sites are pointing atyou, the more likely you will be crawled and ranked well. • Crawling and Index Depth: • Aims to refresh its index on a monthly basis, • If Google doesn't actually index pages, it may still return it in a searchbecause it makes extensive use of the text within hyperlinks. • This text is associated with the pages the link points at, andit makes it possible for Google to find matching pages evenwhen these pages cannot themselves be indexed. Intro to IR and SE, research issues

  39. Google Guidelines for Web-submission Intro to IR and SE, research issues

  40. Deep SubmitPro Intro to IR and SE, research issues

  41. Link Analysis for Relevancy (1) • Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project • CiteSeer….. • http://www.almaden.ibm.com/cs/k53/clever.html • Google ranks web pages based on thenumber, quality and content of links pointing at them (citations). • Number of Links • All things being equal, a page with morelinks pointing at it will do better than a page with few or nolinks to it. • Link Quality • Numbers aren't everything. A single link froman important site might be worth more than many links fromrelatively unknown sites. • Weights page importance – links from important pages weighted higher Intro to IR and SE, research issues

  42. Link Analysis for Relevancy (2) • Link Content • The text in and around linksrelatesto the page they point at. For apage to rank well for "travel," it would need to have manylinks that use the word travel in them or near them on thepage. It also helps if the page itself is textuallyrelevant for travel. • Ranking boosts on text styles • The appearance of terms in bold text, or in header text, or ina large font size is all taken into account. None of these aredominant factors, but they do figure into the overallequation. Intro to IR and SE, research issues

  43. PageRank • Usage simulation & Citation importance ranking: • Based on a model of a Web surfer who follows links and makes occasional haphazard jumps, arrivingat certain places more frequently than others. • User randomly navigates • Jumps to random page with probability p • Follows a random hyperlink from the page with probability 1-p • Does not go back to a previously visited page by following a previously traversed link backwards • Google finds a type of universallyimportant page intuitively • locations that are heavily visited in a random traversal of theWeb's link structure. Intro to IR and SE, research issues

  44. PageRank Heuristics • Process modelled by the following heuristics • probability of being in each page is computed, p set by the system • wj = PageRank of page j • ni = number of outgoing links on page i • m is the number of nodes in G (the number of Web pages in the collection) Intro to IR and SE, research issues

  45. w1 w2 w3 n3 n1 n2 p m PageRank Illusrtation w1 wm + wj w2 (1- p) wn + w3 Intro to IR and SE, research issues

  46. Google Spamming • Link popularity rankingsystem leaves it relatively immune to traditional spammingtechniques. • Goes beyond the text on pages todecide how good they are. No links, low rank. • Common spam idea • Create a lot of new pages within a site that link to a single page, in aneffort to boost that page's popularity, perhaps spreading out these pages across a network ofsites. • The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03 http://www.wired.com/wired/archive/12.03/google.html?pg=7 Intro to IR and SE, research issues

  47. http://www.wired.com/wired/archive/12.03/google.html?pg=7 Intro to IR and SE, research issues

  48. Topic Search http://www.google.com/options/index.html Intro to IR and SE, research issues

  49. Brief Introduction to Semantic Web Intro to IR and SE, research issues

  50. Machine Process-able Knowledge on the Web • Unique identity of resources and objects- URI • Metadata Annotations • Data describing the content and meaning of resources • But everyone must speak the same language… • Terminologies • Shared and common vocabularies • But everyone must mean the same thing… • Ontologies • Shared and common understanding of a domain • Essential for exchange and discovery of knowledge • Inference • Apply the knowledge in the metadata and the ontology to create new metadata and new knowledge Intro to IR and SE, research issues

More Related