Information Retrieval

Information Retrieval Yu Hong and HengJi jih@rpi.edu October 15, 2014

Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI

Information

Basic Function of Information • Information = transmission of thought Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding

Information Theory • Better called “communication theory” • Developed by Claude Shannon in 1940’s • Concerned with the transmission of electrical signals over wires • How do we send information quickly and reliably? • Underlies modern electronic communication: • Voice and data traffic… • Over copper, fiber optic, wireless, etc. • Famous result: Channel Capacity Theorem • Formal measure of information in terms of entropy • Information = “reduction in surprise”

Transmitter channel Receiver message noise The Noisy Channel Model • Information Transmission = producing the same message at the destination as that was sent at the source • The message must be encoded for transmission across a medium (called channel) • But the channel is noisy and can distort the message Source Destination message

Sender Recipient Encoding Decoding Transmitter channel storage Receiver message message indexing/writing acquisition/reading noise A Synthesis • Information retrieval as communication over time and space, across a noisy channel Source Destination message message noise

What is Information Retrieval? • Most people equate IR with web-search • highly visible, commercially successful endeavors • leverage 3+ decades of academic research • IR: finding any kind of relevant information • web-pages, news events, answers, images, … • “relevance” is a key notion

What is Information Retrieval (IR)? • Most people equate IR with web-search • highly visible, commercially successful endeavors • leverage 3+ decades of academic research • IR: finding any kind of relevant information • web-pages, news events, answers, images, … • “relevance” is a key notion

Interesting Examples • Google image search • Google video search • People Search • http://www.intelius.com • Social Network Search • http://arnetminer.org/ http://images.google.com/ http://video.google.com/

IR System Document corpus Query String Sender Recipient Encoding Decoding 1. Doc1 2. Doc2 3. Doc3 . . storage message message Ranked Documents indexing/writing acquisition/reading noise IR System

The IR Black Box Documents Query Results

Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Results

Building the IR Black Box • Fetching model • Comparison model • Representation Model • Indexing Model

Building the IR Black Box • Fetching models • Crawling model • Gentle Crawling model • Comparison models • Boolean model • Vector space model • Probabilistic models • Language models • PageRank • Representation Models • How do we capture the meaning of documents? • Is meaning just the sum of all terms? • Indexing Models • How do we actually store all those words? • How do we access indexed terms quickly?

Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI

Fetching model: Crawling Documents Search Engines Web pages

Crawling Fetching Function World Wide Web Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Results

Fetching model: Crawling • Q1: How many web pages should we fetch? • As many as we can. More web pages = Richer knowledge = Intelligent Search engine Document corpus IR System Query String 1. Doc1 2. Doc2 3. Doc3 . . Ranked Documents

Fetching model: Crawling • Q1: How many web pages should we fetch? • As many as we can. • Fetching model is enriching the knowledge in the brain of the search engine I know everything now, hahahahaha! Fetching Function IR System

Fetching model: Crawling • Q2: How to fetch the web pages? • First, we should know the basic network structure of the web • Basic Structure: Nodes and Links (hyperlinks) World Wide Web Basic Structure

Fetching model: Crawling • Q2: How to fetch the web pages? • Crawling program (Crawler) visit each node in the web through hyperlink. IR System Basic Network Structure

Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-1: what are the known nodes? • It means that the crawler know the addresses of nodes • The nodesare webpages • So the addresses are the URLs (URL: Uniform Resource Locater) • Such as: www.yahoo.com, www.sohu.com, www.sina.com, etc. • Q2-2: what are the unknownnodes? • It means that the crawler don’t know the addresses of nodes • The seed nodes are the known ones • Before dispatching the crawler, a search engine will introduce some addresses of the web pages to the crawler. The web pages are the earliest known nodes (so called seeds)

Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Known Nod. Nod. Unknown Nod. Doc. Unknown Nod. Unknown Nod. Unknown

Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Nod. Nod. Unknown Nod. Doc. Unknown Nod. Unknown Nod. Unknown

Fetching model: Crawling Known PARSER • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. Good news for me. Known Known Nod. Nod. Unknown Known Nod. Doc. Unknown Known Nod. Unknown Known Nod. Unknown

Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? • If you introduce a web page to the crawler (let it known the web address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses. • But if you don’t tell the crawler anything, it will be on strike because it can do nothing. • That is the reason why we need the seed nodes (seed web pages) to awaken the crawler. Give me some seeds.

Fetching model: Crawling I need some equipment. • Q2: How to fetch the web pages? • To traverse the whole network of the web, the crawler need some auxiliary equipment. • A register of FIFO (First in, First out) data structure, such as QUEUE. • An Access Control Program (ACP) • Source Code Parser (SCP) • Seed nodes crawler FIFO Register ACP SCP

Fetching model: Crawling I am working now. • Q2: How to fetch the web pages? • Robotic crawling procedure (Only five steps) • Initialization: push seed nodes (known web pages) into the empty queue • Step 1: Take out a node from the queue (FIFO) and visit it (ACP) • Step 2: Steal necessary information from the source code of the node (SCP) • Step 3: Send the stolen text information (title, text body, keywords and Language) back to search engine for storage (ACP) • Step 4: Push the newly found nodes into the queue • Step 5: Execute Step 1-5 iteratively

Fetching model: Crawling • Q2: How to fetch the web pages? • Trough the steps, the number of the known nodes continuously grows • The underlying reason why the crawler can travers the whole web • Crawler stops working untiltheregister isempty • Although the register is empty, the information of all nodes in the web has been stolen and stored in the server of the search engine. I control this. Slot Slot New Node Slot New Node Slot Slot Slot Slot Slot Slot Slot New Node Slot New Node Slot New Node Slot Slot Slot New Node New Node Slot New Node New Node New Node New Node Slot New Node New Node Slot Seed Seed Seed Slot Slot Slot New Node New Node Slot Slot Slot Slot Slot New Node New Node Slot Slot Slot Slot Slot Slot Slot

Fetching model: Crawling • Problems • 1) Actually, the crawler can not traverse the whole web. • Such as encountering the infinite loop when falling into a partial closed-circle network (snare) in the web Node Node Node No. Node Node Node Node Node

Node Fetching model: Crawling Node Node Node • Problems • 2) Crude Crawling. • A portal web site causes a series of homologous nodes in the register. Abided by the FIFO rule, the iterative crawling of the nodes will continuously visit the mutual server of the nodes. It is crude crawling. Slot Node Node Node Slot Node Node Node Slot Node Node Slot Node https:// www.yahoo.com Node https://screen.yahoo.com/live/ A class of homologous web pages linking to a portal sit Slot Node Slot https://games.yahoo.com/ Node Slot https://mobile.yahoo.com/ Slot https://groups.yahoo.com/neo Slot Node Slot Node Slot https://answers.yahoo.com/ Slot Node Slot Slot http://finance.yahoo.com/ Slot Slot Slot Node Slot https://weather.yahoo.com/ Slot Node Slot Slot Slot Slot https://autos.yahoo.com/ Slot Slot Slot Slot Node Slot https://shopping.yahoo.com/ Slot Node Slot Slot Slot Slot https://www.yahoo.com/health Slot Slot Slot https://www.yahoo.com/food Slot Slot Slot https://www.yahoo.com/style Network of Web

Fetching model: Crawling • Homework • 1) How to overcome the infinite loop cased by the partial closed-circle network in the web? • 2) Please find a way to crawl the web like a gentlemen (not crude). • Please select one of the problems as the topic of your homework. A short paper is necessary. No more than 500 words in the paper. But please include at least your idea and a methodology. The methodology can be described with natural languages, flow diagram, or algorithm. • Send it to me. Email: tianxianer@gmail.com • Thanks.

Building the IR Black Box • Fetching models • Crawling model • Gentle Crawling model • Comparison models • Boolean model • Vector space model • Probabilistic models • Language models • PageRank • Representation Models • How do we capture the meaning of documents? • Is meaning just the sum of all terms? • Indexing Models • How do we actually store all those words? • How do we access indexed terms quickly?

Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Results

Documents Query Representation Function Representation Function Query Representation Document Representation Ignore Now Index Comparison Function Results

A heuristic formula for IR (Boolean model) • Rank docs by similarity to the query • suppose the query is “spiderman film” • Relevance= # query words in the doc • favors documents with both “spiderman” and “film” • mathematically: • Logical variations (set-based) • Boolean AND (require all words): • Boolean OR (any of the words):

Term Frequency (TF) • Observation: • key words tend to be repeated in a document • Modify our similarity measure: • give more weight if word occurs multiple times • Problem: • biased towards long documents • spurious occurrences • normalize by length:

Inverse Document Frequency (IDF) • Observation: • rare words carry more meaning: cryogenic, apollo • frequent words are linguistic glue: of, the, said, went • Modify our similarity measure: • give more weight to rare words … but don’t be too aggressive (why?) • |C| … total number of documents • df(q) … total number of documents that contain q

1 2 3 TF normalization • Observation: • D1={cryogenic,labs}, D2 ={cryogenic,cryogenic} • which document is more relevant? • which one is ranked higher? (df(labs) > df(cryogenic)) • Correction: • first occurrence more important than a repeat (why?) • “squash” the linearity of TF:

Common wordsless important Repetitions of query words  good Penalize very long documents More query words  good State-of-the-art Formula

Strengths and Weaknesses • Strengths • Precise, if you know the right strategies • Precise, if you have an idea of what you’re looking for • Implementations are fast and efficient • Weaknesses • Users must learn Boolean logic • Boolean logic insufficient to capture the richness of language • No control over size of result set: either too many hits or none • When do you stop reading? All documents in the result set are considered “equally good” • What about partial matches? Documents that “don’t quite match” the query may be useful also

cat cat cat θ • cat cat pig dog dog Vector-space approach to IR cat • cat cat • cat pig pig • pig cat dog Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Some formulas for Similarity Dot product Cosine Dice Jaccard t1 D Q t2

An Example • A document space is defined by three terms: • hardware, software, users • the vocabulary • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved?

An Example (cont.) • In Boolean query matching: • document A4, A7 will be retrieved (“AND”) • retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) • In similarity matching (cosine): • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with ranking)= • {A4, A7, A1, A2, A5, A6, A8, A9}

Probabilistic model • Given D, estimate P(R|D) and P(NR|D) • P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t1=x1, t2=x2, …}

Information Retrieval