Download
set 4 n.
Skip this Video
Loading SlideShow in 5 Seconds..
SET (4) PowerPoint Presentation

SET (4)

192 Vues Download Presentation
Télécharger la présentation

SET (4)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. SET(4) Prof. Dragomir R. Radev radev@cs.columbia.edu

  2. SET Fall 2013 … 6. Automated indexing/labeling Compression …

  3. Indexing methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic: e.g., TF*IDF based

  4. LOC subject headings A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html

  5. Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine

  6. Automatic methods • TF*IDF: pick terms with the highest TF*IDF scores • Centroid-based: pick terms that appear in the centroid with high scores • The maximal marginal relevance principle (MMR) • Related to summarization, snippet generation

  7. Compression • Methods • Fixed length codes • Huffman coding • Ziv-Lempel codes

  8. Fixed length codes • Binary representations • ASCII • Representational power (2k symbols where k is the number of bits)

  9. Variable length codes • Alphabet: A .-  N -.  0 ----- B -...  O ---  1 .---- C -.-.  P .--.  2 ..--- D -..  Q --.-  3 ...— E .  R .-. 4 ....- F ..-. S ... 5 ..... G --. T -  6 -.... H .... U ..-  7 --... I ..  V ...-  8 ---.. J .---  W .--  9 ----. K -.-  X -..- L .-..  Y -.— M --  Z --.. • Demo: • http://www.scphillips.com/morse/

  10. Most frequent letters in English • Most frequent letters: • E T A O I N S H R D L U • Demo: • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm • Also: bigrams: • TH HE IN ER AN RE ND AT ON NT

  11. Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37.5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

  12. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h

  13. Exercise • Consider the bit string: 01101101111000100110001110100111000110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.

  14. Extensions • Word-based • Domain/genre dependent models

  15. Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment

  16. <0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

  17. Links on text compression • Data compression: • http://www.data-compression.info/ • Calgary corpus: • http://en.wikipedia.org/wiki/Calgary_Corpus • Huffman coding: • http://www.compressconsult.com/huffman/ • http://en.wikipedia.org/wiki/Huffman_coding • LZ • http://en.wikipedia.org/wiki/LZ77

  18. SIDEBAR: 100 alternative search engines • http://www.readwriteweb.com/archives/top_100_alternative_search_engines.php

  19. SET Fall 2013 … 7. Approximate string matching …

  20. Levenshtein edit distance • Examples: • Theatre-> theater • Ghaddafi->Qadafi • Computer->counter • Edit distance (inserts, deletes, substitutions) • Edit transcript • Done through dynamic programming

  21. Recurrence relation • Three dependencies • D(i,0)=i • D(0,j)=j • D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] • Simple edit distance: • t(i,j) = 0 iff S1(i)=S2(j)

  22. Example Gusfield 1997

  23. Example (cont’d) Gusfield 1997

  24. Tracebacks Gusfield 1997

  25. Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics • Homology information • BLAST • Blosum

  26. Links • Web site: • http://odur.let.rug.nl/~kleiweg/lev/ • Demo: • /home/cs6998/tools/editDistance/dp/l.pl theater theatre

  27. Other methods • Cosine • Generation probabilities (language modeling) • (exp)KL-divergence

  28. SET Fall 2013 … 8. Query expansion Relevance feedback …

  29. Query expansion

  30. Query expansion • Corpus-based: mine query logs • NLP-based • Vector-space relevance feedback

  31. Relevance feedback • Problem: initial query may not be the most appropriate to satisfy a given information need. • Idea: modify the original query so that it gets closer to the right documents in the vector space

  32. Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

  33. Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • R = ? • S = ? • Q’ = ?

  34. Pseudo relevance feedback • Automatic query expansion • Thesaurus-based expansion (e.g., using latent semantic indexing – later…) • Distributional similarity • Query log mining

  35. Examples Lexical semantics (Hypernymy): Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer:adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

  36. Examples (query logs) • Book: booksellers, bookmark, blue • Computer: sales, notebook, stores, shop • Fruit: recipes cake salad basket company • Games: online play gameboy free video • Politician: careers federal office history • Newspaper: online website college information • Schools: elementary high ranked yearbook • California: berkeley san francisco southern • French: embassy dictionary learn

  37. [Otterbacher et al. HLT EMNLP 2005]

  38. Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.

  39. Readings • 4: MRS15, MRS16 • 5: MRS17 • 6: MRS18, MRS19