1 / 32

Search Engines A Test Report Wolfgang Dalitz Zuse Institute Berlin (ZIB)

Search Engines A Test Report Wolfgang Dalitz Zuse Institute Berlin (ZIB). 16th International Congress of the Austrian Mathematical Society (ÖMG) Annual Meeting of the German Mathematical Society (DMV) MATHEMATIK 2005 KLAGENFURT September 18 – 23, 2005. Contents. History and Motivation

imala
Télécharger la présentation

Search Engines A Test Report Wolfgang Dalitz Zuse Institute Berlin (ZIB)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search EnginesA Test ReportWolfgang DalitzZuse Institute Berlin (ZIB) 16th International Congress of theAustrian Mathematical Society (ÖMG)Annual Meeting of the German Mathematical Society (DMV) MATHEMATIK 2005KLAGENFURT September 18 – 23, 2005

  2. Contents • History and Motivation • Test Scenario • Search Engines • Results • Outlook

  3. Math-Net • Concept for a distributed service for mathematics • Service by and for a community … but … • „Give and take“ do not work properly today

  4. Community Driven Services • The concept of cooperative, open, and public domain-oriented services has some boundaries: • Manpower and resources • No scientific merits • People do not consider this an important service • There is not sufficient backing by your (scientific) environment

  5. Nevertheless • Math-Net has been a successful project for a long time, at least in Germany: • Personal infrastructure • Combination of decentralized and central components has been working for a long time with small resources • Spin-off for other networks • Internationalization

  6. Own Services? „We do have Google!“

  7. tagesspiegel Apr 23, 2005 „We were able to fully benefit from the growth of online advertising." Eric Schmidt Investors were able … to realize a capital gain of nearly 160 percent. (Issue price of $ 85, now: $ 216, forecast: $ 270)

  8. tagesspiegel Jul 23, 2005

  9. www.heute.de Sep 16, 2005

  10. c't 9/2005Apr 18, 2005 Manipulation attempts to upgrade ranking "… AltaVista was loaded with keyword-packed spam to such an extent, that it was hardly of any use any longer at the end of 1997 – a problem that AltaVista was never able to totally overcome since, in 1998, another venture appeared on the scene that rapidly advanced to become the number one: Google." "Link farms" as one example influence Google‘s ranking

  11. What do we learn from this? • Search engines are important tools to find relevant information. • To run a "good" search engine is a billion dollar business. • There are many attempts to manipulate important search engines.

  12. Fundamental • Science must be independend of services that • are mainly driven by commercial interests. • do not produce verifiable results.

  13. Completeness? • many (60-80%) HTML pages at ZIB could not be traced in Google, AlltheWeb, …

  14. Paradigms • Science has to determine what tools are necessary and important. • Science has to run and control certain techniques and services that are needed for scientific work. • There must be a (financial and organizational) framework to ensure the implementation of these activities.

  15. Google's new idea • 15 mill. books scanned and fulltext indexed • 150-200 mill. USD • period: 10 years (source: dw-world.de, Apr 27, 2005) • university libraries • Michigan: 7 millions • Stanford: 8 millions • Harvard: 40,000 • Oxford: published before 1900 • New York Public Library • selected older titles

  16. … solution will come soon … The Bibliothèque Nationale de France (BNF) appealed to launch a European "counter-attack" against the project. President Jacques Chirac intends to recommend to the EU a project to digitalize the works of the great European libraries. (source: heise.de/newsticker) „This action is directed against nobody, but it would be 'of fundamental importance' for a multicultural society, said Mr. Renaud Donnedieu de Vabres, minister of education."

  17. The European project will be an alternative to Google's online library (dw-world.de) 1 mill. Euro each year (source: ARTE-News, May 2, 2005) necessary budget: "400 mill. Euro in the next 3 years"(said Mr. Jean-Noël Jeanneney, president of the Bibliothèque Nationale de France, Paris, to the Frankfurter Rundschau, Sep 7, 2005)

  18. Self-made job • Target: run a search engine • with small resources for techniques and manpower • with techniques controlled by the people involved • "better than Google" in the mathematics domain • Environment • open domain • community driven • topic oriented • locally operated • in the long run: community based service

  19. Phase I: get all relevant objects: spider, crawler, gatherer Phase II: compile an index summarizer indexer Phase III: generate ("good") results ranking How do search engines work?

  20. Candidates and strategies • Complete systems (phases I, II, III) • harvest (gatherer, broker, glimpse) • swish-e (spider.pl and indexer) • nutch (lucene) • Partial systems • phase I: wget and w3mir • phase II: lucene • phase III: ??

  21. htdig estraier Perfect Search PHPdig TSEP namazu see test reports in different computer magazines (c't, ix, LinuxMagazin) What else?

  22. Site I www.mathematik-21.de 7371 files 2293 HTML 1160 Images 140 Text 81 PDF 19 PS rest: tmp, harvest Site II www.zib.de 70126 files 17981 HTML 17147 Images 2024 PDF 991 PS 140 Text rest: test Test scenariolocal copies from two different sites factor 10

  23. Completeness (phase I) Site I

  24. Explanations • There are • views from the inside (filesystem) • symbolic links • views from the outside (webserver) • People used nonconform HTML • resulting linklists differ

  25. c't 9/2005 Apr 18, 2005 study: only 3.9 % of German Web sites conform to the standard "… 96.1 % of the checked Web pages included illegal code"

  26. Completeness (phase I) Site II

  27. Indexing (phase II) • harvest/glimpse • fast • has to be tuned (summarized) • spider.pl/swish-e • very fast • nutch/lucene • very fast • incremental index

  28. Ranking (phase III) ??? Is that what characterizes a „good“ search engine? Are there objective criteria?

  29. (first) Results • To run search engines implies that • you have time and resources • you control each task • this is nearly a full-time job • To run search engines for a community • is really a project • is not a one-man job • requires many resources

  30. Our proposals • harvest • is satisfactory if all tasks are controlled and corrected in detail • is our favorite for decentralized work • (wget) nutch/lucene, switch-e • are running without problems on smaller sites • but we have no experience on their functionality on really big sites (1 tbyte data)

  31. thanks for your attention

  32. URLs www.suma-ev.de

More Related