110 likes | 243 Vues
This document outlines the fundamentals of Information Retrieval (IR) using the Vector Space Model, emphasizing how to match user's information needs with document concepts efficiently. It provides a roadmap for understanding the stages of IR: indexing, query construction, and retrieval. Key concepts include representing documents and queries as vectors of word occurrences, computing similarities via dot products, and implementing systems that optimize document retrieval. This approach simplifies the process of retrieving meaningful and relevant documents based on user queries.
E N D
CMSC 11500 Introduction to Computer Programming November 27, 2002 Information Retrieval:aka “Google-lite”
Roadmap • Information Retrieval (IR) • Goal: Match Information Need to Document Concept • Solution: Vector Space Model • Representation of Documents and Queries • Computing Similarity • Implementation: • Indexing: Documents -> Vectors • Query Construction: Query -> Vector • Retrieval: Finding “Best” match: Query/Document
The Information Retrieval Task • Goal: • Match the information need expressed by user • (the Query) • With concepts in documents • (the Document collection) • Issues: • How do we represent documents and queries ? • How do we know if they're “similar”? Match?
Vector Space Model • Represent documents and queries with • Pattern of words • I.E. Queries and documents with lots of the same words • Vector of word occurrences: • Each position in vector = word • Value of position x in vector = # times word x occurs • Similarity: • Dot product of document vector & query vector • Biggest wins
Vector Space Model Tv Program Computer Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1
Information Retrieval in Scheme • Representation: • A vector-rep is (vectorof number) • (define-struct doc-rep (id vec)) • A doc is (make-doc-rep id vec) • Where id:symbol; vec: vector-rep • A doc-index is (listof doc) • A query is vector-rep • A simple-web-page (swp) is: • (make-swp h b) • Where (define-struct swp h b); h:symbol; b: (listof symbol)
Three Steps to IR • Three phases: • Indexing: Build collection of document representations • Convert web pages to doc-rep • Vectors of word counts • Query construction: • Convert query text to vector of word counts • Retrieval: • Compute similarity between query and doc representation • Return closest match
Words-to-vector (define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? Wlist) wvec) (else (let ((wpos (posn (car wlist) dict)))) (let ((cur-count (vector-ref wvec wpos))) (vector-set! Wvec wpos (+ cur-count 1)) (words-to-vector (cdr wlist) wvec))))) (define (posn wd dict) (cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict))))
Indexing (define (build-index swp-list) ;; build-index: (listof swp) -> (listof doc-rep) ;; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '()) (else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list)) (make-vector dictionary-size 0))) (build-index (cdr swp-list)))))
Query Construction (define (build-query wlist) ;; build-query: (listof symbol) -> vector-rep ;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))
Retrieval (define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol ;; Finds id of document with best match with query (doc-rep-id (max (lambda (doc) (dot-product (doc-rep-vec doc) query) index)))