1 / 44

Information Retrieval

Information Retrieval. February 10, 2003. Handout #3. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/

bonita
Télécharger la présentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval February 10, 2003 Handout #3

  2. Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M&F 11-12 • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Mondays, 1-4 PM in 409 West Hall

  3. TF*IDF (cont’d)

  4. Vector-based matching • The cosine measure S (dk . ck .idf(k)) sim (D,C) = k S S (dk)2 . (ck)2 k k

  5. IDF: Inverse document frequency TF * IDF is used for automated indexing and for topicdiscrimination: N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1

  6. Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

  7. Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president

  8. Semantic networks

  9. Semantic Networks • Used to represent relationships between words • Example: WordNet - created by George Miller’s team at Princeton • Based on synsets (synonyms, interchangeable words) and lexical matrices

  10. Lexical matrix

  11. Synsets • Disambiguation • {board, plank} • {board, committee} • Synonyms • substitution • weak substitution • synonyms must be of the same part of speech

  12. $ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

  13. Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

  14. Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

  15. Antonymy • “x” vs. “not-x” • “rich” vs. “poor”? • {rise, ascend} vs. {fall, descend}

  16. Other relations • Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. • Hyponymy: {tree} is a hyponym of {plant}. • Hierarchical structure based on hyponymy (and hypernymy).

  17. Other features of WordNet • Index of familiarity • Polysemy

  18. Familiarity and polysemy board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1)

  19. Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees

  20. Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

  21. {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time} Top-level concepts

  22. Properties of words

  23. Word distributions • Negative binomial distribution • In the Brown corpus • the word “said” has p = 9.24 and α = 0.42

  24. Vocabulary growth • Heaps’ Law • V = vocabulary size • V = Knβ, where K and β depend on the text • K is typically between 10 and 100, and β is less than 1 (for TREC-2 it’s between 0.4 and 0.6)

  25. Word length • In TREC-2, word length is 5 characters on average. • If stop words are removed, average length increases to a range from 6 to 7.

  26. Word similarity • Hamming distance - when words are of the same length • Levenshtein distance - number of edits (insertions, deletions, replacements) • color --> colour (1) • survey --> surgery (2) • com puter --> computer ? • Longest common subsequence (LCS) • lcs (survey, surgery) = surey

  27. Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: • spelling correction • hash function • non-recoverable

  28. The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

  29. The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair

  30. Compression

  31. Compression • Huffman coding (prefix property) • Ziv-Lempel codes (better)

  32. Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

  33. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h

  34. Exercise 1 • Consider the bit string: 01101101111000100110001110100111000110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.

  35. Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment

  36. <0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

  37. Markup languages

  38. Markup languages • HTML • SGML • XML

  39. HTML • Focus on presentation, not content

  40. <!SGML "ISO 8879:1986" CHARSET BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 109//CHARSET ECMA-94 Right Part of Latin-1 Alphabet Nr.3//ESC 2/9 4/3" DESCSET 128 32 UNUSED -- no such characters -- 160 1 UNUSED -- nbs character -- 161 94 161 -- 161 through 254 inclusive -- 255 1 UNUSED CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN" SCOPE DOCUMENT SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255 BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4 /0" DESCSET 0 128 0 FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "_-." UCNMCHAR "_-." NAMECASE GENERAL NO ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF NAMES SGMLREF QUANTITY SGMLREF ATTCNT 99999999 ATTSPLEN 99999999 DTEMPLEN 24000 ENTLVL 99999999 GRPCNT 99999999 GRPGTCNT 99999999 GRPLVL 99999999 LITLEN 24000 NAMELEN 99999999 PILEN 24000 TAGLEN 99999999 TAGLVL 99999999 FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK YES SHORTTAG YES LINK SIMPLE YES 1000 IMPLICIT YES EXPLICIT YES 1 OTHER CONCUR NO SUBDOC YES 99999999 FORMAL YES APPINFO NONE> <!DOCTYPE DOCSET [ <!-- File: asr.dtd Author: Jon Fiscus, NIST Desc: This DTD is intended to parse a TDT2 .tkn file. --> <!ELEMENT DOCSET - O (X|W)+> <!ELEMENT X - O EMPTY > <!ELEMENT W - O CDATA > <!ATTLIST DOCSET type (ASRTEXT|NEWSWIRE|CAPTION|TRANSCRIPT|SYSTRAN|A SR_SYSTRAN) #REQUIRED fileid CDATA #REQUIRED collect_date CDATA #REQUIRED collect_src CDATA #REQUIRED src_lang CDATA #REQUIRED content_lang CDATA #REQUIRED proc_remarks CDATA #IMPLIED > <!ATTLIST W recid CDATA #REQUIRED Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Clust CDATA #IMPLIED Conf CDATA #IMPLIED tr (Y|N) #IMPLIED > <!ATTLIST X Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Conf (NA) #IMPLIED > ]> SGML

  41. <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE DOCSENT SYSTEM "../../../../../dtd/docsent.dtd" > <DOCSENT DID='D-20000408_011.e' DOCNO='17706' LANG='ENG' CORR-DOC='D-20000408_01 7.c'> <BODY> <HEADLINE><S PAR="1" RSNT="1" SNO="1"> Beat Drugs Fund Grants $16 million in Sup port of 29 Anti-Drug Projects </S></HEADLINE> <TEXT> <S PAR='2' RSNT='1' SNO='2'>The Governing Committee of the Beat Drugs Fund , cha ired by the Secretary for Security , has approved grants of $16 .39 million for 29 anti-drug projects this year .</S> <S PAR='3' RSNT='1' SNO='3'>The Commissioner for Narcotics , Mrs Clarie Lo , who is also a member of the Governing Committee , said , "The number of drug abuse rs aged below 21 dropped by 13 .6 per cent from 2829 in 1998 to 2 443 in 1999 .< /S> <S PAR='3' RSNT='2' SNO='4'>Despite the continuing drop in recent years , we rec ognise that youths-at-risk are a highly vulnerable group and deserve the full at tention of all those working in the anti-drug field . "</S> <S PAR='4' RSNT='1' SNO='5'> "To prevent our younger generation from abusing dru gs , education and publicity is an on-going campaign; and any relaxation in effo rts might have adverse consequences , " Mrs Lo added .</S> <S PAR='5' RSNT='1' SNO='6'>In considering this year 's applications for the Fun d , the Governing Committee attached importance to those aiming to steer youths- at-risk away from drugs .</S> <S PAR='6' RSNT='1' SNO='7'>Amongst the 29 projects approved this year , 22 are related to drug prevention education and publicity ($10 .72 million) , five to treatment and rehabilitation ($2 .98 million)and two to research ($2 .69 milli on) .</S> <S PAR='7' RSNT='1' SNO='8'>An amount of $2 .08 million was granted to conduct a pioneering longitudinal research on the development and validation of a drug pr evention programme in Hong Kong .</S> <S PAR='8' RSNT='1' SNO='9'>Youths-at-risk aged between 10 to 15 in selected are as including Tuen Mun and Kwun Tong will be invited to take part in the project .</S> <S PAR='8' RSNT='2' SNO='10'>Participants will be taught on the adverse effect o f drug abuse , social and personal skills to help them identify and resist peer influence to use drugs .</S> </TEXT> </BODY> </DOCSENT> <!-- DTD for sentence-segmented text --> <!ELEMENT DOCSENT (EXTRACTION-INFO?, BODY)> <!ATTLIST DOCSENT DID CDATA #REQUIRED DOCNO CDATA #IMPLIED LANG (CHIN|ENG) "ENG" CORR-DOC CDATA #IMPLIED> <!-- DID : documentid LANG: language --> <!ELEMENT EXTRACTION-INFO EMPTY> <!ATTLIST EXTRACTION-INFO SYSTEM CDATA #REQUIRED RUN CDATA #IMPLIED COMPRESSION CDATA #REQUIRED QID CDATA #REQUIRED> <!ELEMENT BODY (HEADLINE?,TEXT)> <!ELEMENT HEADLINE (S)*> <!ELEMENT TEXT (S)*> <!ELEMENT S (#PCDATA)> <!ATTLIST S PAR CDATA #REQUIRED RSNT CDATA #REQUIRED SNO CDATA #REQUIRED> <!-- PAR: paragraph no RSNT: relative sentence no (within paragraph) SNO: absolute sentence no --> docsent.dtd example.docsent XML

More Related