1 / 22

A Cross Platform Application for Searching the Web

A Cross Platform Application for Searching the Web. Linux Bangalore/2001 Manu Konchady December 12th, 2001. Problem. - The Web consists of more than 1.5 billion pages as of June, 2001 and grows at over a million pages a day (excluding the ‘hidden web’).

kassia
Télécharger la présentation

A Cross Platform Application for Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Cross Platform Application for Searching the Web Linux Bangalore/2001 Manu Konchady December 12th, 2001

  2. Problem - The Web consists of more than 1.5 billion pages as of June, 2001 and grows at over a million pages a day (excluding the ‘hidden web’). - 99% of these pages may not be of interest to any single individual. - How do we locate the valuable (relevant) pages in the least time ?

  3. What does the web look like ? Terminology: - In-links, links to a page - Out-links, link froma page - Hub, a page with many Out-links - Authority, a page with many In-links

  4. What does the web look like ? Contd. An experiment by Altavista and IBM to analyse 200 Million Web pages and 1.5 Billion links - SCC, A Strongly Connected Core is the heart of the Web (56 Million) - IN pages can reach the SCC, but cannot be reached from it (43 Million) - OUT pages can be reached from the SCC, but do not link back (44 Million) - Tendrils are pages not accessible from the SCC (44 Million) - Disconnected pages (16 Million pages)

  5. What does the web look like ? Contd.

  6. What does the web look like ? Contd. Observations: - The fraction of web pages with x In-links is proportional to 1 / (x ** 2.1) (power law) - Similar observation for pages with Out-links - Hub pages are useful to navigate the web and increase the connectivity of the web - Authority pages should be easier to find than other pages (multi-topic or single topic)

  7. Current Products to Search the Web - These products are also known as Agents, Bots, or Spiders. - Most of the products available are Windows based: BullsEye, Copernicus, Lexibot, and others - These products collect results from hundreds of search engines and perform some limited organization and analysis.

  8. Cross Platform Tools Why Cross Platform ? - While Linux grows in popularity, a majority of apps are written for the Windows platform Which development tools ? - MySql, Perl, and Java. - All 3 tools are Open Source Tables to store and manage information, Perl to collect and process the information, and Java to display the information

  9. MySQL MySQL evolved from a database written in 1979 and today runs on multiple platforms including Linux, Windows, and Solaris. It was developed at TcX, a Swedish company and recently made Open Source. - It is light weight and fast compared to other relational databases. - Comes with extensive online documentation (An O’Reilly book on MySql is available as well) - Installed on over 0.5 million servers worldwide - Works with several Gigabytes of data - Supports interfaces to a variety of languages including Perl, Java, C, Python, C, and C++ - Available for free download from www.mysql.com

  10. Perl Perl (Practical Extraction and Reporting Language) was first developed by Larry Wall in 1987. It was created to overcome problems with awk, shell scripts, and C. - It is a scripting language (interpreted rather than precompiled) - Easy to include many of the UNIX tools without shelling out - i.e. combines tools such as sed, tr, awk, grep and others. - Originally designed for fast text manipulation - Available on Linux and other UNIX platforms as well as Windows (Over 29 Ports of Perl) - Data structures are not bounded by prebuilt limitations - Some of the applications of Perl include Database access, File management, CGI Scripts, Client/Server processing, Data Formatting, Disk management, Process management, and a cross platform GUI based on the Tk toolkit. - Hundreds of public domain modules to perform a variety of functions - Very popular with system administrators and some developers - It is famous for being difficult to read, many ways to implement the same function. - Leaves programming discipline to the discretion of the developer

  11. Java Java was developed by Sun in 1995. Major additions of APIs and other functions to the language have made it a popular language for developers. Our interest in the language is to build a sophisticated user interface for the applications. The Swing API (part of the Java Foundation Classes, JFC) was an enhancement to the existing Abstract Window Toolkit (AWT) API. A variety of components such as buttons, check boxes, radio buttons, scroll bars, text panes, slider bars, and other complex widgets Supports dynamic tables, periodic updates of tables without user intervention Many options for setting boundaries, colors, scrolling, and controlling behaviour of components

  12. How do they work together ?

  13. How do they work together ? JDBC and DBI are standards for issuing database calls. The use of JDBC and DBI makes it easier to change databases, if necessary. Parallel processing on platforms differ - Linux uses threads or processes to run in parallel, a process will have its own memory space, while a thread runs in the same memory space and uses less resources - Windows also provides threads and processes - Harder to implement parallel code using Threads than Processes in Perl - The fork or system call is the easiest way to start an independent process in Linux - In Windows, the Win32 API can be used to start independent processes

  14. A Perl Spider - Intelligent pruning

  15. A Perl Spider - Architecture

  16. A Perl Spider contd. - Create multiple independent spider processes - Query search engines such as Google or Wisenut - Select URLs to process from a common table - Use DB locks to synchronize access to the table - Use Fork or System calls in Linux and the Win 32 API in Windows

  17. A Perl Spider contd. Assign relevancy to a web page based on user queries and additional keywords - Frequency of occurrence of keywords - Location of keywords - Word distance between keywords Prioritize domains and process URLs from high priority domains before other URLs Block certain sites or restrict access to a few sites

  18. A Perl Spider contd. Evaluating links to follow: - rank a link to an external site higher than a link to the same site - check if any of the query keywords occur in the anchor text or link itself - assign a higher weight to links from a very relevant page - follow the link if it exceeded a threshold (low, medium, or high

  19. A Perl Spider contd. A spider terminates when - No more URLs can be processed - Time limit exceeded - URL limit exceeded - User decides to stop

  20. Results - List of Hubs - List of Authorities - List of pages ordered by relevance - List of sites with highest average relevancy - Export the link structure to a link analysis tool

  21. Summary - Application to address the searching problem on the Web - Use of cross platform tools (MySQL, Perl, and Java) to build the application - Architecture of the solution (parallel processing, user interface, and NLP.

  22. Backup Slides

More Related