260 likes | 424 Vues
Chapter 2 Information Retrieval Part-1. Modern Information Retrieval. Document representation Using keywords Relative weight of keywords Query representation Keywords Relative importance of keywords Retrieval model Similarity between document and query Rank the documents
E N D
Modern Information Retrieval • Document representation • Using keywords • Relative weight of keywords • Query representation • Keywords • Relative importance of keywords • Retrieval model • Similarity between document and query • Rank the documents • Performance evaluation of the retrieval process
Document Representation Transforming a text document to a weighted list of keywords
Stopwords Figure 2.2 A partial list of stopwords
Activity: Document Representation Transform the text in the document given into a weighted list of keywords.
Stemming A given word may occur in a variety of syntactic forms • plurals • past tense • gerund forms (a noun derived from a verb) Example The word connect, may appear as • connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.
Stemming A stem is what is left after its affixes (prefixes and suffixes) are removed Suffixes • connector, connection, connections, connected, connecting, connects, Prefixes • preconnection, and postconnection. Stem • connect
Porter’s Algorithm • Letters A, E, I, O, and U are vowels • A consonant in a word is a letter other than A, E, I, O, or U, with the exception of Y • The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant • For example, Y in synopsis is a vowel, while in toy, it is a consonant • A consonant in the algorithm description is denoted by c, and a vowel by v
Porter’s algorithmStep 1 Step 1: plurals and past participles
Porter’s algorithmStep 2 Steps 2–4: straightforward stripping of suffixes
Porter’s algorithmStep 3 Steps 2–4: straightforward stripping of suffixes
Porter’s algorithmStep 4 Steps 2–4: straightforward stripping of suffixes
Porter’s algorithmStep 5 Steps 5: tidying-up
Porter’s algorithm Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)
For the Tutorial • Bring your laptop/ lab • Make sure you have Java installed • Bring any English language text document, extension must be .txt • Number of words (no more than 1000 words)
Term-Document Matrix • Term-document matrix (TDM) is a two-dimensional representation of a document collection. • Rows of the matrix represent various documents • Columns correspond to various index terms • Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row).
Normalization • raw frequency values are not useful for a retrieval model • prefer normalized weights, usually between 0 and 1, for each term in a document • dividing all the keyword frequencies by the largest frequency in the document is a simple method of normalization:
Vector Representation of document d1 (word, frequency, normalized frequency)
Mini project (Survey) Arabic language stemmer design • Survey and compare existing Arabic language stemmers and write a research paper. • Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers