Wrapper Generation and HTML Reduction

Wrapper Generation and HTML Reduction Yu Li

Outline • 网页抽取问题 • SGWrap System • HTML的问题 • HTML约简 • 基本想法 • 问题的定义和目标 • 页面模型 • 算法设计 • Future work

页面抽取的问题 • Web上存在大量的数据，以半结构化的HTML页面形式存在 • Web数据集成需要将半结构化的数据转换成为结构化的数据 • 页面抽取的任务：按照用户要求，将半结构化的Web数据转换成为结构化数据 • 完成页面抽取任务的程序通常叫做wrapper

mapping wrapper 页面抽取问题

页面抽取问题 • 页面抽取的工作可以通过 • 手工编写wrapper：使用传统语言，将mapping“硬”编码在wrapper程序中 • 借助工具生成wrapper：通过计算机辅助生成wrapper程序 • 抽取规则、交互方式、维护 • 完全自动进行 • 页面结构的划分、Annotation

SGWrap System • SGWrap＝Schema Guided Wrapper Generation SGWrap System interact generate Wrapper Program run HTML page data

SGWrap System • SGWrap mainly consists of three parts. • SGWrap Runtime (Runtime, for short), which provides service to access our algorithms for web page content extraction. It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or integrate the Runtime itself. • SGWrap Compiler (Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form. It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Java's compiler javac.exe. • Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules.

SGWrap System – basic usage

SGWrap System – basic usage • 3 Steps • Design Rule by Using Visual SGWrap • Compile Rule into Program by Using SGWrapC • Test and Apply Wrapper by Using SGWrap (Runtime) • There is a tutorial at http://idke.ruc.edu.cn/sgwrap/doc/A-10-Minutes-Tutorial.html (also in documentation of each installation)

Welcome to http://idke.ruc.edu.cn/sgwrap

mapping wrapper SGWrap Rule Language 如何形式化的描述？

SGWrap Rule Language • A formula language describing the intent of user is important for web data extraction systems. It should be • Exact. This is the basic constrain. As wrapper program must give out exact result for automatically extraction, the language describing wrapper's intention must also be exact. • Expressive. The language should be able to describe typical intention and consideration of user. In our case, it should be able to express DOM tree navigation and structure result construction. • Compact. The language should be simple and powerful. It can describe the problem in short script, and it should have facilities helping user performing general operation, such as string operations. • Understandable. Rule is not only for computer but also for human. So the language should be human understandable, for the possibility that human will custom and adjust it.

SGWrap Rule Language • SGWrap's Rule is designed to be that type of language. It is exact as it uses XPath as the basic DOM Tree description method. It is expressive as it introduces XQuery's FLWR expression for result construction. It is also compact and understandable. • Rule consists of three parts: • (a)an assign clause, • (b)a variable name for returning result and • (c)a return clause, which can be a variable name or a function clause or a Rule array containing other Rules.

SGWrap Rule Language - example • { • LET $Web_robots:=document($d) • // document($d) is expression reserved by SGWrap Rule which is used to • // represent the concept ``root'' of a document. • RETURN <Web_robots> • { • FOR $robot IN $Web_robots/HTML/BODY/TABLE/TBODY/TR • // Following we will have a array of Rules, which means that the result • // consists of a serials of child node. • RETURN <robot> • { • LET $name:=$robot/TD[0]/A • RETURN <name>$name</name> • } • { • LET $Platform:=$robot/TD[1]/TABLE/TBODY/TR[contains(./TH, "Platform:")]/TD • RETURN <Platform>$Platform</Platform> • } • </robot> • } • </Web_robots> • } Refer to http://idke.ruc.edu.cn/sgwrap/doc/Rule-Specification.html#Rule-Specification for specification.

SGWrap Rule Language • SGWrap Rule Language应用在HTML网页的抽取上出现了一些问题 • HTML网页给抽取带来困难 • 规则没有条件分支语句，不具备条件选择的能力 • 规则建立在W3C DOM模型上，而W3C DOM标准与事实标准（IE DOM）并不符合

What is HTML? • “To publish information for global distribution, one needs a universally understood language, a kind of publishing mother tongue that all computers may potentially understand. The publishing language used by the World Wide Web is HTML (from HyperText Markup Language). ” • “HTML gives authors the means to: • Publish online documents with headings, text, tables, lists, photos, etc. • Retrieve online information via hypertext links, at the click of a button. • Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc. • Include spread-sheets, video clips, sound clips, and other applications directly in their documents. ”

HTML的问题 • 不同标签序列，相似的排版效果 • 用表示页面元素的标签组合出与划分文档结构的标签相似的排版语义 • 结构可以任意嵌套，且允许无意义的嵌套 • 结构划分与修饰语义掺杂在一起，对于文本的修饰造成不必要的结构

Amazon1 Amazon2 Google

Categories of HTML tags • HTML 4.01共定义了93个标签用于网页的编写。 • 93个标签用于各种用途。HTML标准并没有明确的将其归类。

Statistic on HTML tags • Data set • Taken from http://www.data.binghamton.edu/vsewrapper.html • Contain thousands result HTML page got by querying different DBSE • Statistic 1 • How often do HTML page use various tags? • Summary of appearance number in each page • Statistic 2 • How often do various tags appear in HTML pages? • Summary of HTML page number have specify tag

Statistic 1 result • Top tags • TD, A, TR, BR, FONT, IMG, B, SPAN, TABLE, INPUT, OPTION, P, I, DIV • Top tags for defining structure • SPAN(32314), TABLE(27591), P(13769), DIV, LI, BODY, HTML, DD, UL,

Statistic 2 result • Top tags • A, HEAD, BR, BODY, HTML, TITLE, IMG, TABLE, TR, TD, FORM, INPUT, B, FONT, META, P, LINK, DIV, SCRIPT • Top tags for defining structure • BODY(1765), HTML(1754), TABLE(1672), P(1269), DIV(1069), SPAN

Statistic conclusion • HTML语言定义了大量的标签，但是只有少部分是经常使用的 • 原因：用作语义说明的标签在标准的后期加入，但是人们编写HTML页面时并不关心语义的说明，只关心排版的效果 • HTML是一个“大杂烩”似的语言，在结构化文档中夹杂了大量交互式成分 • 原因：HTML语言的标准经过不停的增量升级，偏向于复杂 • HTML页面需要也可能通过约简的方式，将结构信息分离出来 • 经常使用的标签不到一半 • 用于结构划分的标签只占HTML全部标签数量的1/4左右

小结 • HTML文档包含了 • 简单的文档结构划分 • 各种页面元素 • HTML的目标是为了能够显示页面，其结构信息隐含在标签和标签的组合中 • HTML文档中存在着不必要的结构和冗余，文本因为修饰的关系被划分为不连续的结构 • 人们在使用HTML标签时倾向于使用少数简单的标签来合成各种语义

W3C HTML Document with content structure specified Document build with structure and page element Extraction Requirements

HTML约简 • 如果能够从element sequence中发现structure information，就能够将HTML文档转换为适合抽取的、仅包含结构信息的文档 Document build with structure and page element Document with content structure specified Program capture structure information in element sequence

问题的定义 • 对于HTML文档H，通过程序的处理，得到相应的具有结构信息的文档S，使得S • 具有与H一样的结构信息，即对于文本内容的划分不变 • 能够反映H中的修饰信息，即能够知道哪些文本有修饰，但是不引入过度的结构 • 去除H中冗余的结构，且自身不含有冗余的结构

需要解决的问题 • 一个合适的用于结构化信息描述的文档模型 • 具有相当的描述能力，可以合适的描述大部分常用的文档结构信息 • 保证文本的连续特性 • 本身不具有冗余的特性 • 探索当前HTML页面中存在的各种结构化信息 • 已经定义在HTML中、由特定标签表示出来的结构化信息 • 编写网页过程中用标签组合的方式模拟的、存在于传统文档排版理论中的结构化信息 • 设计一套算法可以形式化的计算出HTML文档的结构化信息

Page Model • 需要设计出什么样的Page Model？ • 仅描述Structure Information • 保证文本的连续性，即在某个structure中的连续文本不因为修饰而被中断 • 不会冗余 • 不存在同样语义的嵌套结构 • 不存在不必要的结构 • Page <!ELEMENT page ((text|figure|table)+)>

Line(id) “Page Model for HTML Reduction” Region Region Region Page Model line <!ELEMENT line (#PCDATA, regions?)> <!ATTLIST line id CDATA #REQUIRED> <!ELEMENT regions (region+)> <!ELEMENT region #PCDATA> <!ATTLIST region begin CDATA #REQUIRED end CDATA #REQUIRED> figure <!ELEMENT figure #PCDATA> <!ATTLIST figure id CDATA #REQUIRED>

Page Model item <!ELEMENT item (prefix?, content, line_list)> <!ATTLIST item id CDATA #REQUIRED> <!ELEMENT prefix #PCDATA> <!ELEMENT content (line|figure)+> <!ELEMENT line_list #PCDATA> item list <!ELEMENT list (item+, line_list)> <!ATTLIST list id CDATA #REQUIRED> prefix content (line|figure)+ line_list

Page Model text <!ELEMENT text ((line|list)+)> <!ATTLIST text id CDATA #REQUIRED> row col col col table <!ELEMENT table (row+)> <!ATTLIST table id CDATA #REQUIRED> <!ELEMENT row (col+, line_list)> <!ELEMENT col ((text|figure)+, line_list)> line_list col (text|figure)+ form -- ignores in this version frame -- ignores in this version head -- ignores in this version script & plugin -- ignores in this version line_list

Page Model - Misc A <page> <A> <B> <C> <D> </page> B D C

Algorithm design • 2种可能的计算方法 • 方法1：从HTML出发，分析HTML标签组合可能形成的pattern，将这些pattern记录下来，然后在一遍或者多遍解析HTML文件的过程中完成转换 • 方法2：先将HTML转换为我们设计的Page Model的文档I，在I中允许冗余结构，然后在I上进行进一步的简化，去除不必要的结构和冗余得到结果文档

caption col colgroup th tfoot table Html fragment thead tr td tbody 方法1：从HTML出发 This is a pattern If prefix is “tfoot”, we get a “foot line” If prefix is “thead”, we get “head information” If prefix is “tbody”, each time we get a “line”

p %inline; %heading; %list; %block; The problem of this method is that There are so many possibilities of tag combination that we can not find all patterns by hand, and this must be done by some programs. %preformatted; DL DIV CENTER BLOCKQUOTE

方法2：在Page Model上进行简化 • 只考虑HTML定义的标签的结构信息，能够很容易的将HTML文档转换到Page Model上的文档I。但是文档I会存在 • 不必要的结构，如文本段落嵌套在一行一列的表格中 • 冗余的结构，如嵌套的多重表格 • 针对文档I • 将不必要的结构消除，或者通过语义的分析转换成为等价的另一种结构 • 消除冗余结构 • 得到最终的结果文档

Future work • Page Model的完善 • 完善描述能力 • 编写详细的specification • 算法的设计和实现 • 实验的设计和完成 • 设计：怎样检验约简的效果，实验数据的选择 • 在SGWrap System中应用

Q&A Thank You!

Wrapper Generation and HTML Reduction

Wrapper Generation and HTML Reduction

Presentation Transcript

Supervised and unsupervised wrapper generation

Joint Optimization of Wrapper Generation and Template Detection

Wrapper Classes

Food Wrapper Poster

Wrapper

Wrapper Construction

Wrapper Classes and ArrayList

Grid Based Data Integration with Automatic Wrapper Generation

Infrastructure Development: Poverty Reduction and Income Generation

CPU Wrapper

Flexible, Safe, and Efficient Dynamic Generation of HTML

italc-aluno-wrapper

Embedded Memory Wrapper Generation for Multi-processor SoC Design

Pallet wrapper

Infrastructure Development: Poverty Reduction and Income Generation

Flow wrapper