1 / 17

The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay

The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay. A Search Engine. To promote use of information available on web in Marathi language Locate the right pages that you need Present the pages to the user in an order of importance.

Lucy
Télécharger la présentation

The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Marathi Portal with a Search EngineCenter for Indian Language Technology Solutions, IIT Bombay

  2. A Search Engine • To promote use of information available on web in Marathi language • Locate the right pages that you need • Present the pages to the user in an order of importance

  3. Types of Searches • Based on user queries • Category based search • Browse through pre-classified categories • Search selected literature which will be hosted on the Marathi Portal

  4. Search Engine: Performance Criteria • Coverage • Cover as many pages as possible. A study has revealed that a large part of the web remains un-indexed • Response time • The user should be presented with the results as quickly as possible • Relevance • The information presented should be relevant and ordered in an order of importance

  5. Main Components of a Search Engine • Crawling unit • Indexing unit • Searching unit • Ranking unit

  6. A Prototype • A prototype has been developed to gauge the complexity and architectural issues involved in developing the complete Marathi Portal

  7. About the Prototype • A search engine prototype has been built with manually selected sites in different categories • It indexes about 1800 pages consisting of over 10,14,000 words • The Engine is developed on Windows platform on MS Access • Monolingual ISFOC pages are covered

  8. Ranking Criteria used in the prototype • Number of words in the query string that appear in the document • In OR search, documents containing maximum number of words in the string is ranked higher • Proximity between words • No. of words that are together within distance of 5 words • Context of the word • Is it in title or body? • Frequency of the desired word in the document • No. of occurrences of the word

  9. A Fast Engine is under Development • A Linux based fast prototype for the same number of pages is being developed. • It takes 2 minutes to build the dictionary, 2 hours to build the index and less than a second to search

  10. What if the Machine that hosts the engine fails? • The index must be in main memory while search is being performed • You cannot afford to loose the index since it would take days (even months for large engines) to build it again on a large number of pages • Dumping the index of the Linux prototype through traversal takes around 35 minutes • But to load it in main memory took 2 minutes!

  11. Requirements from the Infrastructure for the actual Portal • High RAM – in GBs • High Computing Power: Parallel Processing through network of workstations • Parallel IO • As number of users increase, more and more parallelism will have to be employed to guarantee same performance criteria to each user

  12. Representations and Fonts • Currently only ISFOC is supported • There are sites in Marathi with different types of encodings which need to be integrated • Converters • Input/Display technology for Linux

  13. Crawling • Crawling and meta-crawling techniques • Some interesting facts: • E.g. it was found that word ‘Aahe’ is one of the most widely occurring words • Words Aahe and Aani together span most of the documents • There are specific words that occur most widely and most frequently in different categories

  14. Indexing and Searching • Incremental • Dynamic • Fast Search • In Memory

  15. Relevancy • What the user really wants • Heuristics for ranking results • Query modification

  16. Selected Texts • Saint Tukarama’s Abhangs will be made searchable and will be hosted on this website • Search on other selected texts will also be hosted on this website

More Related