1 / 80

Web Noises Detection and Elimination

Web Noises Detection and Elimination. PengBo Dec 3, 2010. What are Web Noises ?. 导航 NavGuide. 主题 Topic. 广告 Adv. Call them Noises. 虽然这些信息对于人浏览 Web 有用,但常常对自动 Web 信息处理带来负面影响,比如 Web page clustering, classification, information retrieval and information extraction.

ura
Télécharger la présentation

Web Noises Detection and Elimination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web NoisesDetection and Elimination PengBo Dec 3, 2010

  2. What are Web Noises?

  3. 导航NavGuide 主题Topic 广告Adv

  4. Call them Noises • 虽然这些信息对于人浏览Web有用,但常常对自动Web信息处理带来负面影响,比如Web page clustering, classification, information retrieval and information extraction. • hamperautomatedinformation gathering and Web data mining, “Template Detection via Data Mining and its Applications”

  5. Non-Relevant Data on the Web • A fundamental problem on the Web: • “non-relevant” – not directly related to the main topic / functionality of the page • Local (intra-page) noise • Irrelevant items within a Web page. • E.g., banner ads, navigational guides Many pages contain lots of non-relevant data

  6. Duplicate data on the Web • Another problem on the Web: • Mirrors,News copy, etc, • Global noise • Redundant objects • Larger than individual page • E.g., mirror sites, duplicated Web pages There are much duplicate or near duplicate data

  7. Why it influences? • Hypertext IR Principles--principles of all link based IR tools: • Relevant Linkage Principle • p links to q  q is relevant to p • Topical Unity Principle • q1 and q2 are co-cited in p  q1 and q2 are related to each other • Lexical Affinity Principle • The closer the links to q1 and q2 are the stronger the relation between them.

  8. Violations of Relevant Linkage Principle • Navigational links • http://www.ibm.com/ • Download links • http://www.beethoven.com/ • Advertisement links • http://www.yahoo.com/ • Endorsement links • http://www.ebay.com/ • Spam links

  9. Violations of Topical Unity Principle • Violations of the Relevant Linkage Principle • Bookmark pages • http://bookmark.yinsha.com/网上书签 • General resource lists • http://sewm.pku.edu.cn/IR-Guide.txt IR Guide • Personal homepages • http://www.cse.iitb.ac.in/~soumen/ Soumen’s Home Page

  10. Violations of Lexical Affinity Principle • Alphabetical index lists • Computer and Communication Companies ("M" entries) • HTML representation • Adjacent cells in the same column are far from each other in the HTML text

  11. IR Tool Problems • Generalization • Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites • Topic drift • Search for “Finite Model Theory” and get SF 49’ers fan web sites • Irrelevance • Get “Yahoo” as a result regardless of the query • Bias • Search for “computing companies” and get Microspy highly ranked

  12. Hypertext Improvement Problem • remove violations of the Hypertext IR principles • process quickly millions of pages Main Goal • Develop hypertext processing techniques that: • automatically improve hypertext data • are efficient and scalable

  13. HypertextCleaner Web Hypertext Cleaning Crawler IR Tool

  14. Template detection

  15. DOM Tree 模版Template

  16. Templates

  17. Templates Detection • Semantic Definition: • A template is a master HTML shell page that is used as a basis for composing new pages • Content of new pages plugged into template shell • All pages share common look & feel • Usually controlled by a central authority • Not necessarily confined to a single site • May include variety of data • Navigational bars • Advertisements • Company info and policies

  18. Search pagelet Ad pagelet Navigation pagelet Services pagelet Company info pagelet

  19. Pagelets • Semantic Definition: • A pagelet is a maximal region of a page that has a single topic or functionality • Not too large • has only one topic / functionality • Not too small • any larger region that contains it has other topics / functionalities

  20. IR with Pagelets Main Idea 1 Use pagelets rather than pages as atomic units for information retrieval Main Idea 2 Eliminate pagelets belonging to templates

  21. Pagelets: Syntactic Definition • A pagelet is a node in the HTML parse tree of a page satisfying the following: • Its HTML tag is one of the following: • <TABLE>, <OL>, <UL>, <AREA>, <P>, <DL>, … • None of it’s children contains more than k hyperlinks • None of its ancestor is a pagelet

  22. p1 p2 p3 p4 p5 Templates: Syntactic Definition A template is a collection T = (p1,…,pk) of pagelets satisfying: • Similarity:p1,…,pk are identical or almost identical • Connectivity • Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. Template Recognition Problem: Given a set of pages S find all the templates in S.

  23. Calculate shingle(p) for each pagelet pS Discard clusters of size 1 Template Recognition in Large Sets Cluster pagelets in S according to shingle Construct graph Gc of pages that own pagelets in C Find undirected connected components of Gc For each remaining cluster C: Output components of size > 1

  24. Evaluation • Question: • How to evaluate the performance/effectiveness of this cleaning algorithm?

  25. Benefits of template detection

  26. Cleaning via feature weighting

  27. Cleaning via feature weighting • In a given Web site • Noisy blocks — Share common contents or presentation styles • Meaningful (or main) blocks — diverse in contents and presentation style • Weighting features makes cleaning automatic (nothing is eliminated) “Eliminating noisy information in Web pages for data mining”

  28. root bc=white BODY width=800 height=200 width=800 bc=red TABLE TABLE IMG DOM trees <BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE> </BODY>

  29. Build Site style tree (SST) common

  30. SST • Style Node S = (ELEMENTs, n) • ELEMENTs — a sequence of element nodes • n — number of pages that has this style • Element Node E = (Tag, Attr, STYLEs) • Tag— tag name. E.g., TABLE, IMG; • Attr— display attributes of Tag. E.g., bgcolor=RED • STYLEs— style nodes below E

  31. Inner Node Leaf Node Quantify the importance

  32. Weighting policy • Inner Node Importance (1) • l = |E.STYLEs| • m = number of pages containing E, |E.parent.n| • pi — percentage of tag nodes (in E.parent.n) using the i-th presentation style • Inner NodeImp(E) — diversity of presentation styles

  33. NodeImp(Body) = -1log1001 = 0 • NodeImp(Table) • = -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) • = 0.29 >0

  34. Weighting policy • Features( terms) of Leaf Node • Importance of Leaf Node’s Features (3) • m = number of pages containing E, |E.parent.n| • pij — probability of ai appears in E of page j • HE(ai) — information entropy of ai • the higherHE(ai), the less important ai

  35. Weighting policy • Leaf Node Importance (2) • N — number of features in E • ai — a feature of content in E • (1-HE(ai)) — information contained in ai • Leaf NodeImp(E) —content diversity of E

  36. root SST: Ep IMG TABLE 3 E t1: PCMag, samsung t2: PCMag, epson t3: PCMag, canon m = 3 N = |{PCMag, samsung, epson, canon}| = 4 HE(PCMag) = -3 * (1/3log31/3) = 1 HE(samsung)=HE(epson) =HE(canon) = -(0+0+1log31) = 0 NodeImp(E) = ((1-1) + 3*(1-0))/4 = 0.75

  37. Transitive Weighting policy 0 0.29 0 Composite Importance 0.75

  38. Page nosie • noisy element node • For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified thresholdt, then we say element node E is noisy. • Maximal noisy element node • meaningfulelement node : • If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. • Maximal meaningfulelement node

  39. Web page cleaning via block elimination • We can use SST (site style tree) to identify & eliminate noise content blocks in a page. • Build SST by sample pages crawled from a site. • Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy • Matching to noisy blocks and not noisy blocks in the tree, given a new page.

  40. Noise Detection and Elimination root Body Table Img Table Table P Tr Tr Text Text A P Img A P P P A Img A A A A A

  41. root Body Table Img Table Table Tr Tr Text After simplification

  42. Summary of the technique • Evaluate Common and Diversity of content and styles • DOM trees SST • Information Entropy Based Evaluation • Node Importance • Composite Importance • Noise detection and automatic matching

  43. Near duplicate detection

  44. Syntactic clustering of the web contents WWW6,1997

  45. Document Representation • How to represent a document? • Represent document content by a feature set,preparing the computations of resemblance or similarity. • For documentD, extract it’s feature set as S(D)

  46. Defining similarity of documents • How to express the concept “roughly the same”precisely? • QuantityDefinition: resemblance • The resemblance fo two documents A and B is a number between 0 and 1.

More Related