Finding Associations in Collections of Text

Finding Associations in Collections of Text 99419-511 김유환

Introduction • The need to develop tools to help users access and understand large quantities of multimodal information • Nontrivial extraction of implicit, previously unknown, and potentially useful information from data • KDT(Knowledge discovery from Text)

The FACT System Architecture • Three sources of information • Knowledge Sources • Background Knowledge • unary and binary predicates over the keyword labeling the documents • 유의어 사전 • GUI • Text Collections • Must either already be labeled with a set of keywords • Or must be fed through a text categorization system that augments documents with such keywords

Associations • FACT focuses on the task of finding association in collections of text. • r={t1,…,tn} : Collection of documents • R={I1,…,Im} : Set of Keywords • t(A) = 1 : A is one of the keywords labeling t • (X) : The set of all documents ti that are labeled (at least) with all the keywords in X. • X is called a -covering if |(X)|>=  • W=>B : association over over r • all documents that are labeled with the keywords in W, at lest a proportion r of them are also labeled with keywords in B

The Query Language • Association-discovery query • What type of keywords are desired in the left-hand and right-hand side of any found associations • Any found association to satisfy • unary predicates • binary predicates : define relationships between keywords • Constraints on the size of the various components of the association • BNF grammar

The Query Language (2) Find : (5/0.5) c1:country, c2:country=>t:topic Where : c1G7, c2 {Arab League}, tExportCommodities(c1) • at least half of the time, whenever a G7 country and an Arab League country label a document, the document is labeled by some topic that is not an export commodity of the G7 country, and this occurs at least 5 times in the collection

Query Execution • 사전 지식 • -cover인 집합의 부분집합은 모두 -cover이다. • The set of candidate -covers is built incrementally, starting from singleton -covers and adding elements to a set so long as the set stays a -cover • Finding associations in the presence of constraints

Presentation of Associations • Provide a browsing tool that helps the user easily focus on the subset of results that are potentially relevant

Applying FACT to Newswire Data • Reuters data • Background Knowledge : CIA World FactBook • Run a series of queries using FACT and compared the CPU time and the number of associations found for each query • 결과 • the specification of background-knowledge constraints actually provides information that is exploited by our discovery algorithm, speeding up the association-discovery process

Final Remarks • Better than Database Query • Presents the user with an easy-to-use graphical interface in which discovery tasks can be specified

Finding Associations in Collections of Text

Finding Associations in Collections of Text

Presentation Transcript

Text Mining: Finding Nuggets in Mountains of Textual Data

Finding replicated web collections

Text Mining: Finding Nuggets in Mountains of Textual Data

FAT – Finding All Taxa (in Text Documents)

Text Mining: Finding Nuggets in Mountains of Textual Data

finding Pleasure and Meaning in the text

FINDING ISAAC LEESER Improving Access to Text Collections with TEI Markup

Text mining : Finding nuggets in mountains of textual data

Finding Full-Text Articles

Finding Full Text Articles

Text Mining: Finding Nuggets in Mountains of Textual Data

Finding Text Trends

Finding savings in your collections budget during tough times

Finding Replicated Web Collections

Finding Predominant Word Senses in Untagged Text

An unattended finding: associations of agricultural airborne pesticide vapors

Finding Full-Text Articles

Finding Full-Text Articles

FAT – Finding All Taxa (in Text Documents)

BasketLens: Searching for baskets of words in text collections

Finding Value in Professional Associations

Library associations collections and the UNISA Archives