1 / 28

AlltheWeb

AlltheWeb. Torbjørn Kanestrøm January 30 th , 2003. Agenda. Who is FAST ? What do we do? Libraries; Relevant projects we have done What is AlltheWeb? Under the Hood: Phrasing & Lemmatization Take a tour of AlltheWeb Simple searches (Web, News, Multimedia, FTP) Advanced Web Search

agnes
Télécharger la présentation

AlltheWeb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AlltheWeb Torbjørn Kanestrøm January 30th, 2003

  2. Agenda • Who is FAST ? • What do we do? • Libraries; Relevant projects we have done • What is AlltheWeb? • Under the Hood: Phrasing & Lemmatization • Take a tour of AlltheWeb • Simple searches (Web, News, Multimedia, FTP) • Advanced Web Search • Results Page • Q & A

  3. Who is FAST? • Fast Search & Transfer (FAST) • Founded 1997 • Public company (Oslo Stock Exchange – June 2001) • One of the fastest growing companies in Europe • Profitable • 200 employees • 40+ Phd’s • 12 offices world wide Munich Norway London Boston San Francisco Paris Tokyo Rome

  4. Understanding content Understand the Intention of a query What we do…

  5. FAST Solutions //TECHNOLOGY Common Technology Platform

  6. FAST Customers & Partners //BACKGROUND FAST is the creator of the real-time integrated search and filter technology solutions that are behind the scenes at some of the world's best known companies with the world's most demanding search problems Enterprise Portals Partners

  7. A few selected projects we have done - Relevant to every librarian

  8. Questia

  9. Questia – the online library

  10. Nordic Web Archive • The Nordic Web Archive is a cooperation between the NordicNational Libraries (Finland, Sweden, Denmark, Norway, Iceland). • Project started in 2000, datacenter built deep inside a mountain in northern Norway • Collecting and archiving web documents of national interest and importance. • Everything published in the national domains (.NO, .DK, .FI etc.) • Everything written on the web in the respective languages • Everything referring to one of the countries (city, company, person, etc.) • Continuous project designed to scale indefinitely • Available to the research community, not a public site.

  11. Elesevier Engineering Information Compendex® is the most comprehensive interdisciplinary engineering database in the world with almost seven million records referencing 5,000 engineering journals and conference materials dating from 1970. The database is updated weekly.

  12. Scirus.com – the web’s Science search //BUSINESS CASES • Combining scientific classification of the “deep web” and proprietary publications Web Server XML • 120M web pages • 17M Elsevier Science publications • Scientific classification • Grouping and identification of related articles • Leading science Index • Understanding content • Scientific navigation “FAST’s core search technology has enabled us to provide the best scientific search results, period” - John Regazzi - Managing Director, Elsevier Science

  13. What is AlltheWeb?

  14. What is AlltheWeb? • Showcase for FAST technology • Test new search features with real live audience • Several milion queries per day • 40% North America, 30% Europe, and 30% rest of World • Integrated interface for searching • 2.1+ billion web pages, PDF docs, MS Word docs, & Flash objects • Continuously refreshed news from 5000+ global/local news sources • 150 million images and videos • 130 million ftp files • 2 million mp3 files • Targeted at advand searches

  15. What makes AlltheWeb different? • Versatility • Searching in 49 languages • Six seperate catalogues (Web, News, Pictures, Videos, MP3, FTP) • Fully customizable front-end (only major search site that is XHTML/CSS compl.) • Solid Index • 2.5 billion web objects (pages, pictures, videos, mp3s, etc.) • One of the fastest refresh cycle (every 7 – 14 days) • Advanced search features • Boolean search • Embedded content selectors • Domain & IP filtering • File format and size filtering • Much more...

  16. Under the Hood - Phrasing & Lemmatization

  17. Under the Hood: Phrasing/Anti-Phrasing • Phrasing: Known phrases are matched as a phrase • New York  “New York” • Based on common phrases, names, movie names, geographic names, etc. • Can detect multiple phrases within same query • Anti-Phrasing: Remove words irrelevant to the query • Who is… • What is… • Combines to create a better query • Who is George Bush  “George Bush” • What is the age of the earth  “the age of the earth” • How do I get to train station in New York  “get to” “train station” in “New York”

  18. Under the Hood:Lemmatization • Lemmatization improves recall • Literal matching only finds a fraction of candidates for a query • Ratio between base and full forms • English: 2 • German, French, Spanish: 5 – 10 • Russian, Polish: 40+ • Typical Cases: Singular/plural variation, case marking, etc. • Stemming vs. Lemmatization • Traditional stemming • Term is stemmed according to rules, e.g. walking  walk • Can easily result in “false” stemmings, e.g. Bobby Browning  Bobby Brown • Lemmatization • Rewriting of terms are controlled by language-sensitive dictionaries • Very comprehensive dictionaries; about 20 “man years”

  19. Take a Tour

  20. AlltheWeb Home Page

  21. Match exact phraseSimilar to using quotes around your query Your query Language detection Simple Search (Web/News) • Web- and News Search • Picture-, Video- and MP3 Search • FTP Search ”WebSearch University”

  22. Simple Search (Rich Media) • Web- and News Search • Picture-, Video- and MP3 Search • FTP Search

  23. Select between 13 different matching algorithms Query/Expression Simple Search (FTP) • Web- and News Search • Picture-, Video- and MP3 Search • FTP Search

  24. Language/Charset49 languagesMost used characters sets Select Search TypeAll the words (AND)Any of the words (OR)The exact phrase Boolean expression Where To MatchBody textPage titleURLHostnameLinks on the page Term / PhraseOnly one phrase/word per filter. Add more filters if necessary. Advanced Web Search Embedded ContentExclude or include pages based on embedded content on these pages Specific Date rangeand Document depth

  25. IP-address filterFor especially interested. Supports most common IP-address/-range syntaxes Document SizeSpecify size of document. Supports exact, less than or more than Page Depth/TypeFilter based on depth of URL and whether ~ occurs in URL Offensive ContentWhether or not to filter out/reduce results with offensive content Save SettingsKeep knobs in the same position when you return Domain FiltersOnly include and/or exclude results from a domain Advanced Web Search (cont.) Region FilterLimit results to different regions PresentationHow many search results to list per page File TypeLimits results to PDF, MS Word, and Macromedia Flash files

  26. Paid contentRevenue funds new features at AlltheWeb Related Queries Web ResultsSearch results from Web Pages, PDF & MS Word files and Macromedia Flash files Multimedia ResultsResults from other catalogues News ResultFlashed in results from real-time News Search catalog Site CollapsingShow the other hits from this site The Result Page Search BarClick tabs to send query to other catalogs Query RewritingDid we rewrite your query? Gives you full control!

  27. www.AllTheWeb .com Has all the advanced search features and functions that you can find on all other major web search engines – combined... And we innovate at a faster pace and invest more in R&D than ever before.

  28. AlltheWeb Q&A

More Related