370 likes | 683 Vues
A D V A N C ED TO P ICS IN I NFO R MATION RET R IEVAL A N D WEB SE A R C H. Lecture 1: I ntroducti on S. M. Vahidipour Vahidipour@kashanu.ac.ir. Outline. In t roduction to the Course Overview of the S e m es t er. Te x t Boo k s. Search Engines:
E N D
ADVANCEDTOPICS ININFORMATIONRETRIEVALANDWEBSEARCH Lecture1:Introduction S. M. Vahidipour Vahidipour@kashanu.ac.ir
Outline • IntroductiontotheCourse • Overview ofthe Semester
TextBooks SearchEngines: InformationRetrievalin Practice W.BruceCroft, Donald Metzler,Trevor Strohman PearsonEducation,2010
TextBooks Modern Information Retrieval: TheConcepts and Technologybehind Search (2nd Edition) RicardoBaeza-Yates,BerthierRibeiro-Neto ACMPressBooks,2010
TextBooks IntroductiontoInformationRetrieval C.Manning, P. Raghavan,and H.Schütze Cambridge UniversityPress,2008
SearchandInformationRetrieval • SearchontheWebisadailyactivityformanypeoplethroughout theworld • Google: 40,000 searches per second (3.5billion per day; 1.2trillion per year) • Yahoo: 3,200 searches per second (280 million per day; 8.4 billion per month) • Bing: 927 searches per second ( 80 million per day; 2.4 billion per month) 106: Million, 109: billion, 1012: Trillion, 1015: Quadrillion, 1018: Quintillion, …
SearchandInformationRetrieval • Searchandcommunicationare most popularusesof thecomputer. • Applicationsinvolvingsearchareeverywhere. • Thefieldof computersciencethatis most involvedwithR&D for searchisinformationretrieval(IR).
InformationRetrieval • “Informationretrievalisafieldconcernedwiththestructure,analysis, organization,storage,searching,andretrievalofinformation.” • (Salton, 1968) • Generaldefinitionthatcan beappliedtomanytypesofinformation andsearchapplications • Stillappropriateafter40 years. • PrimaryfocusofIRsincethe50shasbeenontextanddocuments
Data/Information • Storage • Search
Data/Information • Structured • Unstructured
WhatisaDocument? • Examples: • Webpages, email,books, newsstories,scholarlypapers,textmessages, Word™, Powerpoint™,PDF,forum postings, patents,IM (Instant Messages) sessions,etc. • Commonproperties • Significanttextcontent • Somestructure(≈attributesinDB) • Papers:title,author,date • Email:subject,sender,destination,date
ComparingText • Comparing thequerytexttothedocumenttextand determiningwhat isa goodmatchis thecoreissueof informationretrieval. • Exactmatchingof words is not enough • Manydifferentwaystowritethesame thingina“naturallanguage”like English • Does a news storycontainingthe text “karl benz builtthe first automobile in 1886” match the query “car inverter”? • Definingthe meaning ofa word, a sentence,a paragraph,or astoryis moredifficult thandefiningthemeaning of a databasefield.
DimensionsofIR • IR ismorethanjusttext,and more thanjustweb search • althoughthesearecentral • PeopledoingIR work with different media, differenttypesof searchapplications,and differenttasks • Threedimensionsof IR • Content • Applications • Tasks 20
TheContentDimension • Textualdata,but… • New applicationsincreasinglyinvolvenew media • Video,photos,music, speech • Scanneddocuments(forlegalpurposes) • Liketext,contentis difficulttodescribeand compare • Textmaybe usedtorepresentthem(e.g.,tags) • IR approachestosearchandevaluationareappropriate
TheApplicationDimension • Desktopsearch • Personalenterprisesearch • Seeaboveplusrecentwebpages • P2Psearch • Nocentralizedcontrol • File sharing,sharedlocality • Literaturesearch • Forumsearch • … • Websearch • Mostcommon • Verticalsearch • Restricteddomain/topic • Books,movies, suppliers • Enterprisesearch • Corporateintranet • Databases,emails, web pages, documentation,code,wikis,tags, directories,presentations,spreadsheets
TheTaskDimension • Userqueries/ad-hocsearch • Rangeofqueryenormous,notpre-specified • Filtering • Givenaprofile(interests),notifyaboutinterestingnewsstories • Identifyrelevantuserprofilesfor anewdocument • Classification/categorization • Automaticallyassigntexttooneormoreclassesofagivenset • Identifyrelevantlabelsfordocuments • Questionanswering • Similartosearch • Automaticallyansweraquestionposedinnaturallanguage • Provideconcreteanswer,notlistofdocuments.
MainIssuesinIR • Relevance • A relevantdocumentcontainstheinformationa userwas lookingforwhenhe/shesubmittedthe query • Evaluation • How welldoestherankingmeettheexpectationoftheuser • Users and informationneeds • Usersofa searchenginearetheultimate judgesofquality
IRandSearchEngines • A searchengineis thepracticalapplicationofinformationretrievaltechniquestolargescaletextcollections • Big issuesincludemain IR issuesbutalsosomeothers… InformationRetrieval SearchEngines • Performance:Efficientsearchandindexing • Incorporatingnewdata:Coverageandfreshness • Scalability:Growingwithdataandusers • Adaptability:Tuningforapplications • Specificproblems:e.g.,Spam • Relevance:Effectiveranking • Evaluation:Testingandmeasuring • Informationneeds:Userinteraction Additional
Outline • IntroductiontotheCourse • Overviewof theSemester
SearchEngine • Basicarchitecture • Mainissues • Indexing • Textacquisition • Texttransformation • Indexcreation • Querying • Userinteraction • Ranking • Evaluation
OverviewofTraditional RetrievalModels • Booleanretrieval • Vectorspacemodel • Probabilisticmodels
OverviewofEvaluationMetrics • Effectivenessmetrics • Efficiencymetrics • Training,testing,andstatistics
AdvancedRetrievalModels • Languagemodel-basedretrieval • Learningtorank 30
WordMismatchProblem • Languagemodel-basedapproaches • Translationmodel • Topic model • Wordclustermodel • Wordnet • Dependencymodel • Query expansionapproaches
Advanced/SpecificIRTasks • Querylog andquerysuggestion • Personalizedsearch • Informationextraction • Cross-languageIR • Questionanswering • Recommendationsystems • Enterprisesearch • Digitallibrary • Structuredtext retrieval • Multimediaretrieval