Chapter 4 Query Language

Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Introduction • Goals • Which queries can be formulated • How the formulation is related to underlying information retrieval models • Query languages

Boolean queries Fuzzy Boolean natural language structured queries basic queries proximity phrases pattern matching errors words substrings prefixes suffixes regular expressions extended patterns regular keywords and context

Keyword-Based Querying • single-word queries • A query is formulated by a word • A document is formulated by long sequences of words. • A word is a sequence of letters surrounded by separators • What are letters and separators? • e.g., ‘on-line’ • Chinese sentences are composed of characters without word boundaries • The division of the text into words is not arbitrary(This topic will be dealt with in a special talk for Chinese IR)

斷詞問題 • 問題 • 中文句子詞與詞之間並沒有明顯的分隔記號。 • 這名記者會說國語。 • 這名記者會說國語。 • 這名記者會說國語。 • 詞的定義 • 具有獨立意義，且扮演特定語法功能的字串應視為一個詞。 • 分詞標準 • 中國大陸【信息處理用現代漢語分詞規範】 • 1989年制定 • 1993年呈報國家標準

斷詞問題(續) • 台灣【資訊處理用中文分詞標準草案】 • 1996年中華民國計算語言學學會草擬 • 基本原則 • 語義無法由組合成分直接相加而得之字串，應該分為一分詞單位。例如：撞期 vs 撞山 • 詞類無法由組合成分直接得到，應該合為一分詞單位。例如：好喝

處理模式 • 詞典是不可缺少的重要資源 • 列出“所有”可能的詞 • 把他的確實行動作了分析把，他，的，確實，實行，行動，動作，了，分析 • 電子計算機是會計算題目的機器電子，計算，計算機，電子計算機，是，會，會計，計算，計算題，題目，目的，的，機器 • word lattice電子計算機是會計算題目的機器

處理模式(續) • 歧義排除機置 • 挑出最佳組合 • 策略 • 規則式 • 長詞優先台灣大學是有名的學府長詞遮蔽短詞：這名記者會說國語。 • 除去造成路徑中斷的詞區段 • 經驗法則：偏好三字詞, ... • 剖析器 • 統計式 • 馬可夫模型, 鬆弛法, ... • 效能─各家都宣稱有百分之九十五以上的準確率

處理模式(續) • 問題所在 • 詞典是否收錄所有可能的詞？ • A-錢，凍蒜 • 策略 • 構詞率 • (半)自動建立新的詞典 • 未知詞處理模式

構詞率 • 數詞與量詞的形成 • 一個個, 一條條 • 日期與時間 • 八十五年十月四日 • 名詞或動詞的前綴或後綴 • 學生們 • 特殊動詞 • 丟丟看，吃吃看，寫寫看 • 高高興興，歡歡喜喜，漂漂亮亮，迷迷糊糊 • 打打球，跑跑步，寫寫字 • ...

Context Queries • definition • Search words in a given context, e.g., near other words • types • phrase • a sequence of single-word queries • e.g., enhance retrieval • proximity • a sequence of single words or phrases, and a maximum allowed distance between them are specified • e.g., within distance(enhance, retrieval, 4) will match‘… enhance the power of retrieval …’

Boolean Queries • definition • A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands • e.g., translation AND syntax OR syntactic AND translation OR syntactic syntax query syntax tree

Boolean Queries (Continued) • operands • (e1 OR e2) • Select all documents which satisfy e1 or e2. Duplicates are eliminated. • (e1 AND e2) • Select all documents which satisfy both e1 and e2. • (e1 BUT e2) • Select all documents which satisfy e1 but not e2 • “fuzzy boolean” • Retrieve documents appearing in some operands(The AND may require it to appear in more operands than the OR)

Natural Language • generalization of “fuzzy Boolean” • A query is an enumeration of words and context queries. • All the documents matching a portion of the user query are retrieved.

Pattern Matching • A pattern is a set of syntactic features that must occur in a text segment • types • words • prefixes, e.g., ‘comput’  ‘computer’, ‘computation’, ‘computing’, etc. • suffixes, e.g, ‘ters’  ‘computers’, ‘testers’, ‘painters’, etc. • substrings, e.g., ‘tal’  ‘coastal’, ‘talk’, ‘metallic’, etc. • Ranges (lexicographic order), between ‘held’ and ‘hold’  ‘hoax’ and ‘hissing

Pattern Matching (Continued) • allowing errors • Retrieve all text words which are ‘similar’ to the given word • edit distance: the minimum number of character insertions, deletions, and replacements needed to make two strings equal, e.g., ‘flower’ and ‘flo wer’ • maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern

Pattern Matching (Continued) • regular expressions • union: if e1 and e2 are regular expressions, then (e1 | e2) matches what e1 or e2 matches • concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are formed by the occurrences of e1 immediately followed by those of e2 • repetition: if e is a regular expression, then (e*) matches a sequence of zero or more contiguous occurrence of e. • ‘pro (blem | tein) (s | ) (0 | 1 | 2)*’  ‘problem2’ and ‘proteins’

Pattern Matching (Continued) • extended patterns • subsets of the regular expressions expressed with a simpler syntax • classes of characters • conditional expressions • wild characters which match any sequence in the text • combinations

Structural Queries • mixing contents and structure in queries • contents: words, phrases, or patterns • structural constraints: containment, proximity, or other restrictions on structural elements • issues • what structure a text may have • what queries can be made on which structures • three main structures • form-like fixed structure • hypertext structure • hierarchical structure

Document:a fixed set of fields For example, a mail has a sender, a receiver, a date, a subject and a body field. Search for the mails sent to a given person with “football” in the Subject field Form-like fixed structure text text fields text text

Hypertext structure A hypertext is a directed graph where nodes hold some text the links represent connections between nodes or between positions inside nodes (text contents) (structural connectivity) WebGlimpse: combine browsing and searching on the Web

WebGlimpse(http://tucson.com/webglimpse/) • WebGlimpse is a fast, flexible search engine for finding information in a related web of pages. • The ability to index pages on remote sites provides a level of power one step above most search engine tools. • You can define your own sub-area of the web simply by making a page of links to all relevant sites. • Webglimpse will search by following your links, to whatever 'depth' you specify.

Hierarchical Structure Recursive decomposition of the text

chapter Chapter 4 4.1 Introduction We cover in this chapter the different kinds of … … 4.4 Structural Queries … section section figure title title Introduction We cover … … Structural … … in with figure with section “structural” title

Issues • static or dynamic structure • statistic: there are one or more explicit hierarchies • dynamic: the required elements are built on the fly using text makeup • restrictions on the structure • The text or the answers may have restrictions about nesting and/or overlapping

Issues (Continued) • integration with text • integration of queries on text content with queries on text structure • query language • features • selection of areas that contain (or not) other areas • selection of areas that are contained (or not) in other areas • selection of areas that follow (or are followed by) other areas • selection of areas that are close to other areas • set manipulation • standardization, expressiveness taxonomy or formal categorization

A Sample of Hierarchical Models • PAT Expressions • Overlapped Lists • Proximal Nodes • Tree Matching

Chapter 4 Query Language

Chapter 4 Query Language

Presentation Transcript

Structured Query Language

Query Language

XML Query Language

Chapter 4 Language

Language Integrated Query

Chapter 8 Structured Query Language SQL

Chapter 3 Query Language

Chapter 4 : Query Languages

Chapter 4 : Query Languages

Hibernate Query Language

STRUCTURED QUERY LANGUAGE

Chapter 4 Query Languages

Structured Query Language

Structured Query Language

Chapter 5 Structured Query Language (SQL)

Lecture 4 on Structural Query Language

Structured Query Language (Chapter 8)

Chapter 3 Structured Query Language (SQL)

Chapter 4 : Query Languages

Chapter 3 Structured Query Language (SQL)