Data Mining on Symbolic Knowledge Extracted from the Web

Data Mining on Symbolic Knowledge Extracted from the Web Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute

Abstract • This paper gives a case study of combining information • Unstructured Information • an errorful source of large amounts of potentially useful information • Structured Information • less up-to-date, but reliable as facts • Using information from two kinds of sources • Improves the reliability of data-mined rules Changho Choi, University at Buffalo

Introduction (#1/2) • Challenge • not only gather and represent knowledge existing on the Web, • but also use that knowledge for planning, acting, and creating new knowledge Changho Choi, University at Buffalo

Introduction (#2/2) • First stage • integrating three types of information gathering • Extracting propositional knowledge from highly-structured automatically-generated web pages • Extracting propositional knowledge from free-form, unstructured data sources • Extracting relational knowledge existing on the Web through a combination of web pages and their hyperlink structure • Aim • identify patterns of knowledge that were not explicitly represented as facts on the Web Changho Choi, University at Buffalo

Data sources and features • Extracted features • come directly from crawling the company Web sites • e.g. performs-activity, links-to, officers, sector, location, ... • Wrapper features from secondary sources • rely on a mostly regular format • e.g. hoovers-sector, hoovers-industry, hoovers-type, address, ... • Abstracted features • describe relationships between companies • discretize our continuous features • e.g. same-state, same-city, share-officers, mentions-same, ... Changho Choi, University at Buffalo

Process of acquiring potentially interesting information about companies from the Web 4312 web sites50 pages on each siteswww.3com.com Data Mining The Web Extracting fromcorp. Web sites KB New knowledge Wrapping fromcorp. info. Company informationfrom www.hoovers.com Abstracting features Changho Choi, University at Buffalo

Extracted Features Changho Choi, University at Buffalo

Wrapped Features Feature Values Description hoovers-sector 28 Sector listed on the company’s Hoovers page. hoovers-industry 298 Industry listed on the company’s Hoovers page. hoovers-type 18 Public, private, school etc. address Address as listed on hoovers. City, state Extracted form address. competitor Companies that compete with this company. subsidiary Companies listed as subsidiaries of this company. products 4648 Product categories extracted from the products page. officers Officers listed on the Hoovers page. auditors 266 Company auditors. revenue Revenue data for up to the last 10 years. Net-income Net Income data for up to the last 10 years. Net-profit Net Profit data for up to the last 10 years. employees Number of employees each year for up to the last 10 years. Changho Choi, University at Buffalo

Abstracted Features Changho Choi, University at Buffalo

Data mining algorithms • Discovering associations • by applying the Apriori algorithm • Learning propositional rules • by using the C5.0 algorithm • , which generates a decision tree for the given dataset • Learning relational rules • by using Quinlan’s FOIL system • , which can use patterns in the relationship between companies Changho Choi, University at Buffalo

Experimental results • Apriori Experiments • discover associations in the data using association rules • Decision Trees • generate propositional rules using Decision trees • FOIL Experiments • generate first order rules using the first order rule learning system Changho Choi, University at Buffalo

Result:Apriori Experiments (#1/2) • Threshold • minimal support:10%, minimal confidence: 80% • Some Examples • Highest confidence rule =>intuitively be understood • performs-activity = sell :- locations = united-states,links-to = adobe-systems-incorporated (10.8%, 93.0%)performs-activity = sell :- performs-activity = technical-assistance,links-to = adobe-systems-incorporated (11.8%, 91.1%) Changho Choi, University at Buffalo

Result:Apriori Experiments (#2/2) • Some Examples • Normal rule • performs-activity = sell :- locations = japan (14.5%, 90.8%)performs-activity = research :- locations = japan (14.5%, 90.8%) • Lower support or conficence rule • performs-activity = research :- locations = united-states (26.9%, 72.5%) • hoovers-sector = food-beverage-&-tobacco :- competitor = conagra-inc (1.0%, 89.8%)hoovers-sector = retail :- competitor = kmart-corporation (1.0%, 75.0%)hoovers-sector = energy :- competitor = bp-amoco-p.l.c. (1.1%, 73.0%) Meaningful? Changho Choi, University at Buffalo

Result: Decision Trees • Example : Predict the economic sector • city atlantarevenue1996 <= 0.1 => Diversified Services (28, 0.179)revenue1996 > 0.1 => Computer Software & Services (20, 0.2)city Houstoncoarse-sector [basic-materials, capital-goods, transportation] => Manufacturing (10, 0.3)coarse-sector [financial, healthcare, technlogy] => Computer Software & Services (21, 0.238)coarse-sector [conglomerates, consumer-cyclical, consumer-non-cyclical, energy, services, utilities] => Energy (49, 0.49)city Dallasnet_income1999 <= 19 => Health Products & Services (25, 0.2)net_income1999 > 19 => Leisure (25, 0.2)... Based on NaïveBayes Classification For cities,differentfeatures Changho Choi, University at Buffalo

Result: FOIL Experiments(Fist Order Inductive Logic) • Example • computer-software-&-services(A) :- hq-city(A,B),B<>fremont, competitor(A,C),hq-city(C, Islandia), not(employees_binned(A,?,?)). • It means thatcompanies headquartered somewhere other than Fremont competing with “Computer Associates International” are in the computer software & services sector.(“Computer Associates International” is the only company in our knowledge base headquartered in Islandia.) Changho Choi, University at Buffalo

Discussion • Difficulties • data cleaning • errorful nature of our facts • feature selection • Pleased result • the interaction between the symbolic features and the statistically-derived(naïve Bayes) features Changho Choi, University at Buffalo

Further Work • This paper suggests • a number of research directions • , impacting each of information extraction, machine learning, and data-mining from text • Further work • Extracting information from wrapped web-sites as a source of training data • Automatic data-cleaning of tracted features • Extending the information extraction Changho Choi, University at Buffalo

Reference(#1/2) • FOIL • Three companions for first order data mining • http://www.cs.kuleuven.ac.be/~ml/Doc/Tutorial_Summer/tutorial_summer.html Changho Choi, University at Buffalo

Reference(#2/2) Feature Sample URL hoovers-sector http://www.hoovers.com/sector/ hoovers-industry http://www.hoovers.com/industry/list/ hoovers-type http://www.hoovers.com/company/dir/0,2116,15694,00.html address http://www.hoovers.com/co/capsule/5/0,2163,12475,00.html City, state same competitor same subsidiary http://www.hoovers.com/premium/profile/5/0,2147,12475,00.html products same officers same auditors same revenue http://www.hoovers.com/hoov/join/sample_historical.html Net-income same Net-profit same employees same Changho Choi, University at Buffalo

Data Mining on Symbolic Knowledge Extracted from the Web

Data Mining on Symbolic Knowledge Extracted from the Web

Presentation Transcript

Reasoning With Data Extracted From the Biomedical Literature

CS345A: Data Mining on the Web

Data Mining on the Web via Cloud Computing

CS345A: Data Mining on the Web

Data Mining The Social Web

Simulated data sets Extracted from:

Data Mining The Art and Science of Obtaining Knowledge from Data

DATA MINING Extracting Knowledge From Data

Data Mining: Extracting Knowledge from Past Data

Extracted directly from:

From Data Mining to Knowledge Discovery: An Introduction

Mining of Massive Datasets: Knowledge discovery from data

Learning to Extract Symbolic Knowledge from the World Wide Web

Web-Mining …searching for the knowledge on the Internet…

Ontology-Centered Personalized Presentation of Knowledge Extracted from the Web

Web-Mining Agents Data Mining

Data Mining The Art and Science of Obtaining Knowledge from Data

From Web 2.0 to Web 3.0 using Data Mining

Learning to Extract Symbolic Knowledge from the World Wide Web