Web Crawling Techniques: Strategies, Policies, and Challenges in Information Retrieval
This document examines web crawling as a key technique in information retrieval, detailing the processes employed by web crawlers, including depth-first and breadth-first strategies. It discusses essential crawling policies like selection, revisit, and politeness policies, addressing their significance in ensuring efficient and ethical web crawling practices. It also explores the challenges faced by crawlers, such as handling dynamic pages, minimizing server overload, and implementing strategies for parallelization and load balancing. The study concludes with considerations for an optimal crawler design to navigate the vast web effectively.
Web Crawling Techniques: Strategies, Policies, and Challenges in Information Retrieval
E N D
Presentation Transcript
Information retrieval 2019/2020
crawler • web crawler, Web spider, Web robot • Starts from one/several sources (url) • Stores documents cache / retrieved data • Looks for new urls within documents • Stores new url to the stack • Visits next url (recursively / from stack)
example Hyperlinks are underlined Depth-first: 1,3,2,4,5,6 Breadth-first: 1,3,6,4,2,5
strategies • Breadth-first • Depth-first • Partial PageRank • Restrictions: • Max number of downloaded pages • Max depth • Max time • Documents type • Selected domains • Restricted URL – based on regexp • Download only static documents
crawling policies • selection policy • Which page should be downloaded • re-visit policy • When to visit page again • politeness policy • Do not irritate your collogues • parallelization policy • How to perform parallel crawl
selection policy • breadth-first • Most used? • High PageRank ranked pages will be visited first • Can be improved by partial PageRank • backlink-count • Number of links pointing to the page • partial PageRank • Computed based on already collected urls • OPIC (On-line PageImportanceComputation) • each page is given an initial sum of "cash" which is distributed equally among the pages it points to
deep web • Sometimes dynamic pages ?&… • Sometimes only “through search” available: • No links pointing to the site • Sitemaps • …
re-visitpolicy • uniform • we synchronize all elements at the same rate, regardless of how often they change. That is, all elements are synchronized at the same frequency. • proportional • we synchronize element e with a frequency f that is proportional to its change frequency λ. • freshness of copy • freshness is the fraction of the local database that is up-to-date • “Best strategy” – based on thedomain • (weighted) proportional + ignore of high dynamic pages
re-visitpolicy • Junghoo Cho and Hector Garcia-Molina. 2003. Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28, 4 (December 2003), 390-426. • „we prove that the uniform policy is better than the proportional policy under any distribution of λ values“ • more than 20% of pages had changed whenever we visited them • more than 40% of pages in the com domain changed every day • pages in edu and gov domain are very static
politenesspolicy • Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. • Server overload, especially if the frequency of accesses to a given server is too high. • Poorly written crawlers, which can crash servers or routers. • Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.
politenesspolicy • Time interval • Identification – User-agent HTTP req. • Crawler trap • “is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash.” https://fleiner.com/bots/
crawler trap • http://example.com/bar/foo/bar/foo/bar/foo/bar/... • dynamic pages with infinite number of pages (e.g., calendar) • http://www.example.org/calendar/events?&page=1&mini=2015-09&mode=week&date=2021-12-04 • extremely long pages (lot of text causing lexical analyzer to crash) • …
parallelization policy • Dynamic assignment • Central server is balancingload, URLs • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed. • Staticassignment • Nodes inform others which pages are downloaded • Hash URL websites
problem of similar sources • URL normalization, hash, page fingerprint • Identical content is rare • Crawler tries to detect site differences and makes decision
crawler vs. scraper https://www.quora.com/What-are-the-biggest-differences-between-web-crawling-and-web-scraping
parsing complications • What format is it in? • pdf/word/excel/html? • What language is it in? • What character set encoding is in use? • Each of these is a classification problem, which we will study later in the course • But these tasks are often done heuristically: • The classification is predicted with simple rules • Example: "if there are many “the” then it is English".
parsing complications • Documents being indexed can include docs from manydifferentlanguages • A single index may have to contain terms of severallanguages • Sometimes a document or its components can containmultiplelanguages/formats • French email with a German pdf attachment
segmentation • Header, • Footer, • Menu and navigation, • Main content. • Sentences, • Paragraphs, • Bullets, • Chapters with headline.
emails segmentation • Header, • Email text, • Replied or forwarded text, • Attachments, • Signature.
segmentation approaches • Statistic approaches • No. of words, links comparing to other segments • Machine learning • Supervised learning • Features engineering • Patterns • Regexp, trees, graphs.. • Visual approaches
segmentation approaches https://www.ics.uci.edu/~lopes/teaching/cs221W15/slides/WebCrawling.pdf
to text conversion • HTML: NekoHTML • http://nekohtml.sourceforge.net/ • DOC: MS Word - Apache POI. • http://poi.apache.org/ • PDF: OS Linux - pdftotext. Java – PDFBox • http://pdfbox.apache.org/ • Emails: formateml, mail server, Thunderbird (not MS Outlook) libraryJavaMail. • http://www.oracle.com/technetwork/java/javamail/index.html • Apache Tika • Unified API
tokenization • (Garabík et al., 2004): Token je arbitrárna jednotka textu, ktorá rozširuje lingvistický význam pojmu slovo. Za token sa v automatickej segmentácii textu považuje akýkoľvek reťazec znakov medzi dvoma medzerami (whitespace), aj jednotlivé znaky interpunkcie, ktoré nemusia byť oddelené medzerou od predchádzajúceho alebo nasledujúceho tokenu. Textsa teda z formálneho hľadiska skladá z tokenov a medzier (whitespace).
tokenization • Input: “Friends, Romans and Countrymen” • Output: Tokens • Friends, • Roman • and • Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit?
tokenization • Issues in tokenization: • Finland’scapital → • Finland? Finlands? Finland’s? • Hewlett-Packard → Hewlett and Packard • as twotokens? • state-of-the-art: break uphyphenatedsequence • co-education • lowercase, lower-case, lowercase ? • San Francisco: one token or two? • How do youdecideitisone token?
general idea • If you consider 2 tokens (e.g. splitting words with hyphens) then queries containing only one of the twotokenswillmatch • Ex1. Hewlett-Packard – a query for "packard“ will retrieve documents about "Hewlett-Packard" OK? • Ex2. San Francisco – a queryfor "francisco“ will match docs about "San Francisco" OK? • If you consider 1 token then query containing only one of the two possible tokens will not match • Ex3. co-education – a query for "education“ will not match docs about "co-education".
numbers • 3/20/91 Mar. 12, 1991 20/3/91 • 55 B.C. • B-52 • My PGP key is 324a3df234cb23e • (800) 234-2333 • Often have embedded spaces (but we should not split the token) • Older IR systems may not index numbers • But often very useful: think about things like looking up error codes/stacktraces on the web • Will often index “meta-data” separately • Creation date, format, etc.
LuceneAnalysistokenization http://lucene.apache.org/ • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com]
ElasticAnalysistokenization https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html • "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.„ • Standard analyzer • the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone • Simpleanalyzer • the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone • Stop analyzer • quick, brown, foxes, jumped, over, lazy, dog, s, bone • Patternanalyzer • regexp
lexical analysis • [cesta ~ WORD]; • [9 ~NUMBER]; • [, ~ COLON]; • [1.2.2005 ~ DATE]; • [www.fiit.stuba.sk ~ LINK] • CIT je ... pracovisko ... zriadené k 1.2.2005 • [cit ~ WORD]; [je ~ WORD]; [pracovisko ~ WORD]; [zriadené ~ WORD]; [k ~ WORD]; [1.2.2005 ~ DATE]
lexical tags to terms • compound words (one or several) • inserting words (notebook, laptop) • spell correction • not in documents • when users interact • necessary when queries are text • documents without punctuation (sms, chat, emails)
languageissues • French • L'ensemble-one token or two? • L ? L’ ? Le ? • Wantl’ensemble to matchwithun ensemble • Until at least 2003, itdidn’t on Google • Internationalization!
languageissues • Germannouncompounds are notsegmented • Lebensversicherungsgesellschaftsangestellter • ‘lifeinsurancecompanyemployee’ • Germanretrievalsystems benefit greatlyfrom a compoundsplitter module • Cangive a 15% performanceboostforGerman
Katakana Hiragana Kanji Romaji languageissues • Chinese and Japanese have no spaces between words: • 莎拉波娃现在居住在美国东南部的佛罗里达。 • Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled • Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
languageissues • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start • “Algeria achieved its independence in 1962 after 132 years of French occupation.” • With Unicode, the surface presentation is complex, but the stored form is straightforward
language detection • statistics approaches • N-grams
Na Slovenskomieriobrielietadlo Ked sme ho objednávali, panovalavosveteúplneinásituácia. „Predpokladáme, žebudemepotrebovatpremiestnovatjednotkynaväcšievzdialenosti. (...) Viete, že SR jeaktívna v niekolkýchmisiách, operáciáchužaj v súcasnosti. Tietopotrebujemeneustálezásobovat, prepravovatludí, rotovat.“ Výrokzaznel z ústniekdajšiehoministraobrany Martina Fedora koncommája 2006. PrávevtedypredvádzalizahranicnívýrobcovianavojenskomletiskuKuchynanaZáhorívelkédopravnélietadlá, z ktorýchsimaloSlovenskovybratnáhraduzadosluhujúcestrojeAntonov. Z ponukysmesinapokonvybralidvelietadlá Spartan C-27J. Prvé z nich by maloprístnasledujúcimesiac – viacnež 11 rokov od propagacnejakcie v Kuchyni. MedzicasomsazmenilasituáciavovzdialenomIraku a aj v eštevzdialenejšomAfganistane. Využijemeeštevôbecobjednanélietadlá? Ministerstvomájasnúodpoved. Milióny a miliardy Zaprvélietadlo Spartan talianskejfirmyAleniaAermacchismemalipodladohodyzaplatit 34,5 miliónaeur, dalších 25 miliónoveursi mala vyžiadatpodpora a výcvik. Kedževýrobca s dodávkoumešká, môžemežiadatkompenzácie. Lietadlo by malodorazit v case, ked sa u násdiskutuje o omnohoväcšíchnákupnýchplánoch v armáde. Na obnovuvojenskejtechniky by chcelrezortmiliardyeur. Do akejmierysúplányreálne, by samaloukázatužonedlhopripredstavovaníverejnéhorozpoctunanasledujúceroky. K dodávkeSpartanusanedávnovyjadrilnácelníkgenerálnehoštábuozbrojnýchsíl Milan Maxim. „Urciteneostane bez využitia,“ ubezpecovalnastretnutí s novinármi.
normalization to terms • We need to “normalize” words in indexed text as well as query words into the same form • We want to match U.S.A. and USA • Result is a term: a term is a (normalized) word type, which is an entry in our IR system dictionary • We define equivalence classes of terms by, e.g., • deleting periods to form a term • U.S.A., USA ∈ [USA] • deleting hyphens to form a term • anti-discriminatory, antidiscriminatory ∈[antidiscriminatory]
other languages • Accents: e.g., French résumé vs. Resume • Umlauts: e.g., German: Tuebingen vs. Tübingen • Should be equivalent • Most important criterion: • How are your users like to write their queries for these words? • Even in languages that standardly have accents, users often may not type them • Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen ∈[Tubingen]
Is this German “mit”? other languages • Tokenization and normalization may depend on the language and so is intertwined with language detection • Crucial: Need to “normalize” indexed text as well as query terms identically Morgen will ich in MIT …
case folding • Reduce all letters to lower case • exception: upper case in mid-sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: [fixed in 2011…] • Query C.A.T. • #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
normalization to terms • do we handle synonyms and homonyms? • E.g., by hand-constructed equivalence classes • car = automobile color = colour • We can rewrite to form equivalence-class terms • When the document contains automobile, index it under car-automobile (and vice-versa) • Or we can expand a query • When the query contains automobile, look under car as well • what about spelling mistakes? • One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics