1 / 20

INF 141: Information Retrieval

INF 141: Information Retrieval. Discussion Session Week 2– Winter 2011 TA: Sara Javanmardi. Contact Information. Sara Javanmardi Email: sjavanma {at} uci .edu. Policies. Discussion

hughk
Télécharger la présentation

INF 141: Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF 141: Information Retrieval Discussion Session Week 2– Winter 2011 TA: Sara Javanmardi

  2. Contact Information Sara Javanmardi • Email: sjavanma {at} uci.edu

  3. Policies • Discussion • Attendance is not mandatory but highly recommended (we will discuss good practices in doing projects) but it is mandatory for the quiz sessions. (Q1:1/26, Q2:2/9, Q3:2/23, Q4:3/9) • Questions • Post your questions to the course forum. • Follow the course blog and put comments on the posted blogs. • Assignments • Late assignments (you lose 1% per hour). • Bring questions about the assignment to the discussion session. • Questions sent in the last 24 hours before an assignment’s deadline might not receive answers from the teaching staff.

  4. Policies • Grading • If you have questions, please talk to the TA first, then with the instructor. • Re-grade • Double check before you bring it. • Within 1 week, accompanied by a clear explanation of what needs to be reconsidered and why.

  5. Course Material • In addition to the book • Search Engines: Information Retrieval in Practice

  6. Assignment 2 • Goals: • General and Extra Credit Question • Try to read different sources and cite them • Programming Questions • Writing the code in Java • Work on efficiency of the code • It should take less than 0.5 sec to extract the palindromes or lipograms

  7. Finding Longest Palindrome & Lipogram • Palindrome • Lipogram • Sample input and output shows palindrome, lipogram, and rhopalic, respectively. • Submit your output for the test input.

  8. Finding The Longest Rhopalic • A rhopalic is a sequence of words in which each word increases by one character. • Example: “I do not know where family doctors acquired illegibly perplexing handwriting; nevertheless, extraordinary pharmaceutical intellectuality, counterbalancing indecipherability, transcendentalises intercommunications incomprehensibleness” • It can start with different length words. • Words are separated by at least 1 space, white space, or punctuation. • Sample Java Code

  9. Extracting Rhopalics Starting from index 0, how many tokens after index 0 constitute a rhopalic?

  10. Longest Palindrome • Example: • Input1: uka abb aacd ab ba • Output1: start=2, end=9 start=13, end=17 Which ones is the longest? (Length = end-start+1) Input String Start & end index Is Palindrome

  11. Is palindrome • You have two pointers to the start and end index of a string, • if input[i]==input[j] {i++; j--}, while(i<=j) abbbbba j i

  12. Is palindrome • Or loop over the string and when you are at index m, assume that you are in the middle of a palindrome, move a pointer forward and the other one backward abbbbba i j m

  13. Suffix Tree • You have to run is palindrome for every possible string. • A more efficient way is using a suffix tree

  14. Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

  15. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 4 5 # $ a b a 3 b b 4 $ # # a $ 2 1 b $ 3 2 1

  16. Let s = cbaaba$ then sr = abaabc# a # b $ c 7 7 a b $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 $ 3 3 c # a $ 4 1 2 2 1

  17. Finding maximal palindromes • A palindrome: caabaac, cbaabc • Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

  18. Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

  19. Read/Write From Files • The FileReader and FileWriter always use the system’s default character encoding. If this default is not appropriate (for example, when reading an XML file which specifies its own encoding), then reading and writing will be incorrect. • See more : Reading and writing text files • Scanner vs Buffered reader

  20. Read/Write From Files • You can call these methods from IO.java • To write: StringBuffer sb = new StringBuffer(); IO.writeToFile(sb.toString(), outputFilename, "UTF-8"); • To read: String content = IO.getFileContent(inputFilename, "UTF-8"); StringTokenizer stLine = new StringTokenizer(content, "\n");

More Related