1 / 20

FilipinianaWeb

FilipinianaWeb is a research project aiming to develop a grid-based search engine that focuses on Philippine-related web content. It incorporates intelligent document discovery mechanisms through general-purpose and focused web crawlers. The system includes filters for domain, language, geolocation, and Bayesian analysis. Future plans include integrating focused crawling and supporting other object formats like documents, images, and XML.

edouglas
Télécharger la présentation

FilipinianaWeb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FilipinianaWeb Nestor Michael C. Tiglao Computer Networks Lab (CNL) University of the Philippines 17th APAN Meetings & Joint Techs Workshop Jan. 30, 2004

  2. World Wide Web • Enormous growth (10 billion pages) • Imagine the Web without search engines • Need for intelligent document discovery mechanisms

  3. Web Crawlers • Programs that retrieve Web pages Two kinds: • General-purpose crawlers • Focused crawlers

  4. Sample Query: anthrax

  5. Result 1

  6. Result 2

  7. Focused Crawler • Selectively seek out pages that are relevant to a pre-defined set of topics • Topics are specified by sample documents

  8. Research on Search Engines • Implemented the focused crawler on a Linux cluster using Beowulf and MPI (2002) • Philippine-specific search engine using the openMosix platform (2003)

  9. Focused Crawler Architecture User Interface Results Sample Document Classifier Crawl Tables Distiller Crawler

  10. Focused Crawler Design

  11. Flowchart

  12. Performance (Crawl Time)

  13. Why another search engine? • Existing Philippine search engines: Yehey.com, Alleba, Tanikalang Ginto, Pugad.com and EdsaWorld • actually web directories • We need a better search engine

  14. Unique Situation • Many Philippine-related sites are not registered under the .ph domains • Many sites are hosted outside the Philippines • English as the de facto language

  15. System Design (Gagambot)

  16. Filters • ph Domain filter • gov.ph, edu.ph • Language filter • iso 639, iso-8859-1/latin1 and windows-1252 • subset of Unicode characters utf-8 and us-ascii

  17. Filters 2 • GeoURL filter • Location-to-URL reverse directory • Finds URLs by their proximity to a given location (www.geourl.org) • Bayesian filter • Analyzes the textual content of the HTML document

  18. FilipinianaWeb

  19. Current Plans • Develop FilipinianaWeb on a grid platform • Better filtering techniques • Integrate focused crawling • Support for other object formats: documents, images, XML, etc.

  20. Conclusion • FilipinianaWeb is a work-in-progress and a proof-of-application • Grid infrastructure will help provide the computational and resource requirements of a production-level search engine

More Related