1 / 37

A D V A N C ED TO P ICS IN I NFO R MATION RET R IEVAL A N D WEB SE A R C H

A D V A N C ED TO P ICS IN I NFO R MATION RET R IEVAL A N D WEB SE A R C H. Lecture 1: I ntroducti on S. M. Vahidipour Vahidipour@kashanu.ac.ir. Outline. In t roduction to the Course Overview of the S e m es t er. Te x t Boo k s. Search Engines:

tien
Télécharger la présentation

A D V A N C ED TO P ICS IN I NFO R MATION RET R IEVAL A N D WEB SE A R C H

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVANCEDTOPICS ININFORMATIONRETRIEVALANDWEBSEARCH Lecture1:Introduction S. M. Vahidipour Vahidipour@kashanu.ac.ir

  2. Outline • IntroductiontotheCourse • Overview ofthe Semester

  3. TextBooks SearchEngines: InformationRetrievalin Practice W.BruceCroft, Donald Metzler,Trevor Strohman PearsonEducation,2010

  4. TextBooks Modern Information Retrieval: TheConcepts and Technologybehind Search (2nd Edition) RicardoBaeza-Yates,BerthierRibeiro-Neto ACMPressBooks,2010

  5. TextBooks IntroductiontoInformationRetrieval C.Manning, P. Raghavan,and H.Schütze Cambridge UniversityPress,2008

  6. SearchandInformationRetrieval • SearchontheWebisadailyactivityformanypeoplethroughout theworld • Google: 40,000 searches per second (3.5billion per day; 1.2trillion per year) • Yahoo: 3,200 searches per second (280 million per day; 8.4 billion per month) • Bing: 927 searches per second ( 80 million per day; 2.4 billion per month) 106: Million, 109: billion, 1012: Trillion, 1015: Quadrillion, 1018: Quintillion, …

  7. SearchandInformationRetrieval • Searchandcommunicationare most popularusesof thecomputer. • Applicationsinvolvingsearchareeverywhere. • Thefieldof computersciencethatis most involvedwithR&D for searchisinformationretrieval(IR).

  8. InformationRetrieval • “Informationretrievalisafieldconcernedwiththestructure,analysis, organization,storage,searching,andretrievalofinformation.” • (Salton, 1968) • Generaldefinitionthatcan beappliedtomanytypesofinformation andsearchapplications • Stillappropriateafter40 years. • PrimaryfocusofIRsincethe50shasbeenontextanddocuments

  9. Data/Information • Storage • Search

  10. Data/Information • Structured • Unstructured

  11. Structured vs.UnstructuredData

  12. WhatisaDocument? • Examples: • Webpages, email,books, newsstories,scholarlypapers,textmessages, Word™, Powerpoint™,PDF,forum postings, patents,IM (Instant Messages) sessions,etc. • Commonproperties • Significanttextcontent • Somestructure(≈attributesinDB) • Papers:title,author,date • Email:subject,sender,destination,date

  13. ComparingText • Comparing thequerytexttothedocumenttextand determiningwhat isa goodmatchis thecoreissueof informationretrieval. • Exactmatchingof words is not enough • Manydifferentwaystowritethesame thingina“naturallanguage”like English • Does a news storycontainingthe text “karl benz builtthe first automobile in 1886” match the query “car inverter”? • Definingthe meaning ofa word, a sentence,a paragraph,or astoryis moredifficult thandefiningthemeaning of a databasefield.

  14. DimensionsofIR • IR ismorethanjusttext,and more thanjustweb search • althoughthesearecentral • PeopledoingIR work with different media, differenttypesof searchapplications,and differenttasks • Threedimensionsof IR • Content • Applications • Tasks 20

  15. TheContentDimension • Textualdata,but… • New applicationsincreasinglyinvolvenew media • Video,photos,music, speech • Scanneddocuments(forlegalpurposes) • Liketext,contentis difficulttodescribeand compare • Textmaybe usedtorepresentthem(e.g.,tags) • IR approachestosearchandevaluationareappropriate

  16. TheApplicationDimension • Desktopsearch • Personalenterprisesearch • Seeaboveplusrecentwebpages • P2Psearch • Nocentralizedcontrol • File sharing,sharedlocality • Literaturesearch • Forumsearch • … • Websearch • Mostcommon • Verticalsearch • Restricteddomain/topic • Books,movies, suppliers • Enterprisesearch • Corporateintranet • Databases,emails, web pages, documentation,code,wikis,tags, directories,presentations,spreadsheets

  17. TheTaskDimension • Userqueries/ad-hocsearch • Rangeofqueryenormous,notpre-specified • Filtering • Givenaprofile(interests),notifyaboutinterestingnewsstories • Identifyrelevantuserprofilesfor anewdocument • Classification/categorization • Automaticallyassigntexttooneormoreclassesofagivenset • Identifyrelevantlabelsfordocuments • Questionanswering • Similartosearch • Automaticallyansweraquestionposedinnaturallanguage • Provideconcreteanswer,notlistofdocuments.

  18. MainIssuesinIR • Relevance • A relevantdocumentcontainstheinformationa userwas lookingforwhenhe/shesubmittedthe query • Evaluation • How welldoestherankingmeettheexpectationoftheuser • Users and informationneeds • Usersofa searchenginearetheultimate judgesofquality

  19. IRandSearchEngines • A searchengineis thepracticalapplicationofinformationretrievaltechniquestolargescaletextcollections • Big issuesincludemain IR issuesbutalsosomeothers… InformationRetrieval SearchEngines • Performance:Efficientsearchandindexing • Incorporatingnewdata:Coverageandfreshness • Scalability:Growingwithdataandusers • Adaptability:Tuningforapplications • Specificproblems:e.g.,Spam • Relevance:Effectiveranking • Evaluation:Testingandmeasuring • Informationneeds:Userinteraction Additional

  20. Outline • IntroductiontotheCourse • Overviewof theSemester

  21. SearchEngine • Basicarchitecture • Mainissues • Indexing • Textacquisition • Texttransformation • Indexcreation • Querying • Userinteraction • Ranking • Evaluation

  22. OverviewofTraditional RetrievalModels • Booleanretrieval • Vectorspacemodel • Probabilisticmodels

  23. OverviewofEvaluationMetrics • Effectivenessmetrics • Efficiencymetrics • Training,testing,andstatistics

  24. AdvancedRetrievalModels • Languagemodel-basedretrieval • Learningtorank 30

  25. WordMismatchProblem • Languagemodel-basedapproaches • Translationmodel • Topic model • Wordclustermodel • Wordnet • Dependencymodel • Query expansionapproaches

  26. Advanced/SpecificIRTasks • Querylog andquerysuggestion • Personalizedsearch • Informationextraction • Cross-languageIR • Questionanswering • Recommendationsystems • Enterprisesearch • Digitallibrary • Structuredtext retrieval • Multimediaretrieval

  27. Query LogandQuerySuggestion

  28. PersonalizedSearch

  29. InformationExtraction

  30. Cross-languageRetrieval

  31. QuestionAnswering

  32. RecommendationSystems

  33. Enterprise Search

  34. DigitalLibrary 40

  35. StructuredTextRetrieval

  36. Multimedia Retrieval

  37. Questions?

More Related