Digital Days
E N D
Presentation Transcript
Digital Days Information Retrieval Bassiou Nikoletta Artificial Intelligence and Information Analysis Lab Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Introduction • Models / Techniques • Evaluation of Results • Clustering • References Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Introduction • Research in developing algorithms and models for retrieving information from document repositories (document / text retrieval) • Main activities: • Indexing: representation of documents • Searching: way documents are examined to be characterized as relevant Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques • Logical • Vector Processing • Probabilistic • Cognitive Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model • Documents: represented by index terms or keywords • Requests: logical combinations (AND, OR, NOT) of these terms • Document retrieved when it satisfies the logical expression of the request • Example: D1={A, B}, D2={B, C}, D3={A, B, C} Q=A^B^ ~C A={D1} Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model (cont.) • Drawbacks: • Formulation of the query is difficult / trained intermediaries have to search on behalf of the user • Results: partition of the database into two discrete subjects no mechanism of ranking according to decreasing probability of relevance • All query terms are considered to be equal: they are either present or not • Closed word Assumption: absence of an index term in a document false index for that document • Development of fuzzy set models Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model • Documents/queries are represented in a high-dimensional space • Each dimension corresponds to a word in the document collection • Most relevant documents for a query:documents represented by the vectors closest to the query Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Document: t-dimensional vector Di= (di1, di2, …, dit) dij: weights of the j-th term dij = 0 when j-th term is absent form document Di • Indexing of documents : number of term occurrences in documents, number of documents each word are present, or other • Query : Qj= (qj1, qj2, …, qjt) Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Similarity Computation : • Inner Product: • Cosine: *** When applied to normalized vectors same ranking of similarities as Euclidean Distance Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Drawbacks • Use of indexing terms to define the dimension of the space involves the incorrect assumption that the terms are orthogonal • Practical limitations : for discriminating ranking several query terms are needed while in Boolean models two or three ANDed terms are enough • Difficulty of explicitly specifying synonymic and phrasal relationships Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model • Probability Ranking Principle: Ranking of documents in order of decreasing probability of relevance to a user’s information need • Term-Weight Specification: selectivity/what makes a good term good whether it can pick any of the few relevant documents from the many non-relevant ones Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Collection Frequency: terms in few documents more valuable n : number of document term t (i) occurs in N : the number of documents in the collection • Term Frequency: terms occurring more often in a document are more likely to be important for that document Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term Frequency • Document Length Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Normalized Document Length: serves for the evaluation of Term Frequency • Combined Weight: combination of the above weight measures used for score calculation k1(=2) : affects the extent of influence of Term Frequency b(=0.75) : affects the extent of Document Length’s influence Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term-Weighting Components Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Typical term-weighting formulas Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative searching: Terms Reweigthing / Query Expansion • Relevance weighting: relation between the relevant and non-relevant documents for a search term r=the number of known relevant documents term t(i) appears R=the number of known relevant documents for a request Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative Combination: • Query Expansion: adding to a query new search terms taken by documents assessed as relevant Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Cognitive Model • Focus on: • user’s information-seeking behaviour • ways in which IR systems are used in operational environments • Experiments on • the way in which a user’s information needs may change during his interaction with the IR system more flexible interfaces Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results • Precision: proportion of retrieved documents that are relevant • Recall: proportion of relevant documents that are retrieved • Fallout: proportion of non-relevant documents that are not retrieved • Generality: proportion of relevant documents within the entire collection Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Example: Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Precision-Recall graph Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Three-point average precision: averaging the precision in three different recall levels • Eleven-point average precision: averaging the precision in eleven different recall levels Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering • Non-Exclusive (Overlapping): ex. Fuzzy Clustering degree of belongingness • Exclusive • Extrinsic (Supervised) • Intrinsic (Unsupervised)*: Agglomerative-Devisive • Hierarchical: nested sequence of partitions • Partitional: single partition Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Hierarchical: transformation of the proximity matrix (similarity/dissimilarity indices) into a sequence of nested partitions • Threshold graph G(v) for each dissimilarity level v: inserting an edge (i, j) between nodes i and j if objects i, j are less dissimilar than v((i ,j) G(v)if and only ifd(i, j) v) Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Single-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the number of components (maximally connected subgraphs) in G(k) is less than the number of clusters in the current clustering , redefine the current clustering by naming each component of G(k) as a cluster. • If G(k) consists of a single connected graph, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Complete-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the two of the current clusters form a clique (maximally completed subgraph) in G(k), redefine the current clustering by merging these two clusters into a single cluster. • If k=n(n-1)/2, so that G(k) is the complete graph on the n nodes, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Other Algorithms: • Hubert’s Algorithm for Single-Link and Complete Link • Graph Theory Algorithm for Single-Link Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link • Begin with disjoint clustering having level L(0)=0 and sequence number m=0. • Find the least dissimilar pair in the current clustering, {(r), (s)}, according to d[(r), (s)]=min {d[(i), (j)]} • Set mm+1. Merge clusters (r) and (s). Set the level to L(m)=d[(r) , (s)] • Update the proximity matrix by deleting the rows and columns corresponding to clusters (r) and (s) by adding a row and column for the newly formed cluster. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link (cont.) • The proximity between the new cluster (r, s) and old cluster (k) is defined as follows d[(k), (r, s)]=min {d[(k), (r)], d[(k), (s)]} (single-link) d[(k), (r, s)]=max {d[(k), (r)], d[(k), (s)]} (complete-link) • !!! Generalized Formula d[(k), (r, s)]= ar d[(k), (r)] + as d[(k), (s)] + βd[(r), (s)] + γ|d[(k), (r)]-d[(k), (s)]| Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Coefficient Values for Matrix Updating Algorithms Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Main characteristics of the methods • Single-link: closest pair of objects clusters with little homogeneity • Complete-link (more conservative):most distant pair clusters not well separated • UPGMA: weights equally the contribution of each object taking into account the sizes of the clusters • WPGMA: weights objects in small clusters more heavily than patterns in large clusters • UPGMC-WPGMC: • proximity measure is Euclidean distance • geometric interpretation distance between centroids Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department
Information Retrieval • References • Manning C.D. and Schutze H., Foundations of Statistical Natural Language Processing, MIT Press, 1999. • Jones K.S. and Willett P., Readings in Information Re-trieval,Morgan Kaufman Publishers, San Francisco, California, 1997. • Salton G., Wong A., and Yang C.S., “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, pp. 613–620, 1975. • G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” 1988. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • References (cont.) • Salton G., “The smart environment for retrieval system evaluation-advantages and problem areas,” In K. Sparck Jones (Ed.), Information Retrieval Experiment, pp. 316–329, 1981. • Robertson S.E., “The probability ranking principle in ir,” Journal of Documentation, vol. 33, pp. 126–148, 1977. • Robertson S.E. and Jones S.K., “Simple, proven approaches to text retrieval”, TR 356, Cambridge University, Computer Laboratory, May 1977. • A.K. Jain and R.C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, 1988. Aristotle University of Thessaloniki Informatics Department