1 / 43

CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence. Vishal Vachhani M.Tech , CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR. Road Map. Cross Lingual IR Motivation CLIA architecture CLIA demo Ranking Various Ranking methods Nutch/lucene Ranking

declan
Télécharger la présentation

CS344: Introduction to Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS344: Introduction to Artificial Intelligence VishalVachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR

  2. Road Map • Cross Lingual IR • Motivation • CLIA architecture • CLIA demo • Ranking • Various Ranking methods • Nutch/lucene Ranking • Learning a ranking function • Experiments and results

  3. Cross Lingual IR • Motivation • Information unavailability in some languages • Language barrier • Definition: • Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (wikipedia) • Example: • A user may ask query in Hindi but retrieve relevant documents written in English.

  4. Query in Tamil System English Document search Marathi Document Snippet Generation and Translation English Document Why CLIR?

  5. Cross Lingual Information Access • Cross Lingual Information Access (CLIA) • A web portal supporting monolingual and cross lingual IR in 6 Indian languages and English • Domain : Tourism • It supports : • Summarizationof web documents • Snippet translation into query language • Temple based information extraction • The CLIA system is publicly available at • http://www.clia.iitb.ac.in/clia-beta-ext

  6. CLIA Demo

  7. Various Ranking methods • Vector Space Model • Lucene, Nutch , Lemur , etc • Probabilistic Ranking Model • Classical spark John’s ranking (Log ODD ratio) • Language Model • Ranking using Machine Learning Algo • SVM, Learn to Rank, SVM-Map, etc • Link analysis based Ranking • Page Rank, Hubs and Authorities, OPIC , etc

  8. Nutch Ranking • CLIA is built on top on Nutch – A open source web search engine. • It is based on Vector space model

  9. Link analysis • Calculates the importance of the pages using web graph • Node: pages • Edge: hyperlinks between pages • Motivation: link analysis based score is hard to manipulate using spamming techniques • Plays an important role in web IR scoring function • Page rank • Hub and Authority • Online Page Importance Computation (OPIC) • Link analysis score is used along with the tf-idf based score • We use OPIC score as a factor in CLIA.

  10. Learning a ranking function • How much weight should be given to different part of the web documents while ranking the documents? • A ranking function can be learned using following method • Machine learning algorithms: SVM, Max-entropy • Training • A set of query and its some relevant and non-relevant docs for each query • A set of features to capture the similarity of docs and query • In short, learn the optimal value of features • Ranking • Use a Trained model and generate score by combining different feature score for the documents set where query words appears • Sort the document by using score and display to user

  11. Extended Features for Web IR • Content based features • Tf, IDF, length, co-ord, etc • Link analysis based features • OPIC score • Domains based OPIC score • Standard IR algorithm based features • BM25 score • Lucene score • LM based score • Language categories based features • Named Entity • Phrase based features

  12. Content based Features

  13. Details of features

  14. Details of features(Cont)

  15. Experiments and results

  16. Crawling, Indexing

  17. Outline • Nutch Overview • Crawler in CLIA system • Data structure • Crawler in CLIA • Indexing • Types of index and indexing tools • Searching • Command line API • Searching through GUI • Demo

  18. Crawler • The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index.

  19. Crawler Data Structure • Web Database (webdb) • persistent data structure for web graph being crawled. • Stores pages and links • Segment • A collection of pages fetched and indexed by the crawler in a single run • Index • Inverted index of all of the pages the system has retrieved

  20. Injector Fetcher Parser CrawlDBTool Generator Web Crawler Initial URLs CrawlDB Webpages/files update get read/write generate read/write Segment

  21. Crawl command • Aimed for intranet-scale crawling • A front end to other, lower-level tools • It performs crawling and indexing • Create a URLS directory and put URLs list in it. • Command • $NUTCH_HOME/bin/nutch crawl urlDir [Options] • Options • -dir: the directory to put the crawl in. • -depth: the link depth from the root page that should be crawled. • -threads: the number of threads that will fetch in parallel. • -topN: number of total pages to be crawled. • Example bin/nutchcrawl urls-dir crawldir -depth 3 -topN 10

  22. Inject command • Inject root URLs into the WebDB • Command • $NUTCH_HOME/bin/nutch inject <crawldb> <urldir> <crawldb>: Path to the Crawl Database directory<urldir>: Path to the directory containing flat text url files

  23. Generate command • Generates a new Fetcher Segment from the Crawl Database • Command: • $NUTCH_HOME/bin/nutch generate <crawldb> <segments_dir> [-topN <num>] [-numFetchers <fetchers>] • <crawldb>: Path to the crawldb directory.<segments_dir>: Path to the directory where the Fetcher Segments are created.[-topN <num>]: Selects the top <num> ranking URLs for this segment[-numFetchers <fetchers>]: The number of fetch partitions.

  24. Fetch command • Runs the Fetcher on a segment • Command : • $NUTCH_HOME/bin/nutch Fetch <segment> [-threads <n>] [-noParsing] • <segment>: Path to the segment to fetch[-threads <n>]: The number of fetcher threads to run[-noParsing]: Disables automatic parsing of the segment's data

  25. Parse command • Runs ParseSegment on a segment. • Command • $NUTCH_HOME/bin/nutch parse <segment> • <segment>: Path to the segment to parse.

  26. Updatedb command • Updates the Crawl DB with information obtained from the Fetcher • Command: • $NUTCH_HOME/bin/nutch updatedb <crawldb> <segment> • <crawldb>: Path to the crawl database.<segment>: Path to the segment that has been fetched.

  27. Index and Indexing • Sequential Search is bad (Not Scalable) • Indexing – the creation of a data structure that facilitates fast, random access to information stored in it. • Types of Index • Forward Index • Inverted Index • Full Inverted Index

  28. Forward Index • It stores a list of words for each documents • Example D1=“it is what it is.” D2=“what is it.” D3=“it is a banana”

  29. Inverted Index • It stores a list of documents for each word

  30. Full Inverted Index • It is used to support phrase search. • Query: “What is it”

  31. Invertlink command • Updates the Link Database with linking information from a segment • Command: • $NUTCH_HOME/bin/nutch invertlink <linkdb> (-dir segmentsDir | segment1 segment2 ...) • <linkdb>: Path to the link database.<segment>: Path to the segment that has been fetched. A directory or more than one segment may be specified.

  32. Index command • Creates an index of a segment using information from the crawldb and the linkdb to score pages in the index • Command: • $NUTCH_HOME/bin/nutch index <index> <crawldb> <linkdb> <segment> ... • <index>: Path to the directory where the index will be created<crawldb>: Path to the crawl database directory<linkdb>: Path to the link database directory<segment>: Path to the segment that has been fetched More then one segment may be specified

  33. Dedup command • Removes duplicate pages from a set of segment indexes • Command: • $NUTCH_HOME/bin/nutch dedup <indexes> <indexes>: Path to directories containing indexes

  34. Merge command • Merges several segment indexes • Command: • $NUTCH_HOME/bin/nutch merge <outputIndex> <indexesDir> ... • <outputIndex>: Path to a directory where the merged index will be created.<indexesDir>: Path to a directory containing indexes to merge. More then one directory may be specified.

  35. Configuring CLIA crawler • Configure file: $NUTCH/conf/nutch-site.xml • Required user parameters • http.agent.name • http.agent.description • http.agent.url • http.agent.email • Optional user parameters • http.proxy.host • http.proxy.port

  36. Configuring CLIA crawler • Configure file: $NUTCH/conf/crawl-urlfilters.txt • Regular expression to filter URLs during crawling • E.g. • To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ • To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/ • change the following line • @ 26 line of crawl-urlfileters.txt #skip everything else +.

  37. CrawlDB LinkDB Indexer Searcher Searching and Indexing Segments (Lucene) Index (Lucene) GUI (Tomcat)

  38. Crawl Directory Structure • Crawldb • Contains the information about every URL known to Nutch • Linkdb • contains the list of known links to each URL • Segment • crawl_generatenames a set of urls to be fetched • crawl_fetch contains the status of fetching each url • content contains the content of each url • parse_text contains the parsed text of each url • parse_datacontains outlinks and metadata parsed from each url • crawl_parse contains the outlinkurls, used to update the crawldb • Index • Contains Lucene-format indexes.

  39. Searching • Configure file: $NUTCH/conf/nutch-default.xml • Change the following property: • searcher.dir – complete path to you crawl folder • Command line searching API • $NUTCH_HOME/bin/nutch org.apache.nutch.searcher.NutchBean queryString

  40. Searching • Create clia-alpha-test.war file using “ant war” • Deploy clia-alpha-test.war file in tomcat webapp directory • http://localhost:8080/clia-alpha-test/

  41. Thanks

More Related