190 likes | 299 Vues
Data Mining on Symbolic Knowledge Extracted from the Web. Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute. Abstract. This paper gives a case study of combining information Unstructured Information
E N D
Data Mining on Symbolic Knowledge Extracted from the Web Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute
Abstract • This paper gives a case study of combining information • Unstructured Information • an errorful source of large amounts of potentially useful information • Structured Information • less up-to-date, but reliable as facts • Using information from two kinds of sources • Improves the reliability of data-mined rules Changho Choi, University at Buffalo
Introduction (#1/2) • Challenge • not only gather and represent knowledge existing on the Web, • but also use that knowledge for planning, acting, and creating new knowledge Changho Choi, University at Buffalo
Introduction (#2/2) • First stage • integrating three types of information gathering • Extracting propositional knowledge from highly-structured automatically-generated web pages • Extracting propositional knowledge from free-form, unstructured data sources • Extracting relational knowledge existing on the Web through a combination of web pages and their hyperlink structure • Aim • identify patterns of knowledge that were not explicitly represented as facts on the Web Changho Choi, University at Buffalo
Data sources and features • Extracted features • come directly from crawling the company Web sites • e.g. performs-activity, links-to, officers, sector, location, ... • Wrapper features from secondary sources • rely on a mostly regular format • e.g. hoovers-sector, hoovers-industry, hoovers-type, address, ... • Abstracted features • describe relationships between companies • discretize our continuous features • e.g. same-state, same-city, share-officers, mentions-same, ... Changho Choi, University at Buffalo
Process of acquiring potentially interesting information about companies from the Web 4312 web sites50 pages on each siteswww.3com.com Data Mining The Web Extracting fromcorp. Web sites KB New knowledge Wrapping fromcorp. info. Company informationfrom www.hoovers.com Abstracting features Changho Choi, University at Buffalo
Extracted Features Changho Choi, University at Buffalo
Wrapped Features Feature Values Description hoovers-sector 28 Sector listed on the company’s Hoovers page. hoovers-industry 298 Industry listed on the company’s Hoovers page. hoovers-type 18 Public, private, school etc. address Address as listed on hoovers. City, state Extracted form address. competitor Companies that compete with this company. subsidiary Companies listed as subsidiaries of this company. products 4648 Product categories extracted from the products page. officers Officers listed on the Hoovers page. auditors 266 Company auditors. revenue Revenue data for up to the last 10 years. Net-income Net Income data for up to the last 10 years. Net-profit Net Profit data for up to the last 10 years. employees Number of employees each year for up to the last 10 years. Changho Choi, University at Buffalo
Abstracted Features Changho Choi, University at Buffalo
Data mining algorithms • Discovering associations • by applying the Apriori algorithm • Learning propositional rules • by using the C5.0 algorithm • , which generates a decision tree for the given dataset • Learning relational rules • by using Quinlan’s FOIL system • , which can use patterns in the relationship between companies Changho Choi, University at Buffalo
Experimental results • Apriori Experiments • discover associations in the data using association rules • Decision Trees • generate propositional rules using Decision trees • FOIL Experiments • generate first order rules using the first order rule learning system Changho Choi, University at Buffalo
Result:Apriori Experiments (#1/2) • Threshold • minimal support:10%, minimal confidence: 80% • Some Examples • Highest confidence rule =>intuitively be understood • performs-activity = sell :- locations = united-states,links-to = adobe-systems-incorporated (10.8%, 93.0%)performs-activity = sell :- performs-activity = technical-assistance,links-to = adobe-systems-incorporated (11.8%, 91.1%) Changho Choi, University at Buffalo
Result:Apriori Experiments (#2/2) • Some Examples • Normal rule • performs-activity = sell :- locations = japan (14.5%, 90.8%)performs-activity = research :- locations = japan (14.5%, 90.8%) • Lower support or conficence rule • performs-activity = research :- locations = united-states (26.9%, 72.5%) • hoovers-sector = food-beverage-&-tobacco :- competitor = conagra-inc (1.0%, 89.8%)hoovers-sector = retail :- competitor = kmart-corporation (1.0%, 75.0%)hoovers-sector = energy :- competitor = bp-amoco-p.l.c. (1.1%, 73.0%) Meaningful? Changho Choi, University at Buffalo
Result: Decision Trees • Example : Predict the economic sector • city atlantarevenue1996 <= 0.1 => Diversified Services (28, 0.179)revenue1996 > 0.1 => Computer Software & Services (20, 0.2)city Houstoncoarse-sector [basic-materials, capital-goods, transportation] => Manufacturing (10, 0.3)coarse-sector [financial, healthcare, technlogy] => Computer Software & Services (21, 0.238)coarse-sector [conglomerates, consumer-cyclical, consumer-non-cyclical, energy, services, utilities] => Energy (49, 0.49)city Dallasnet_income1999 <= 19 => Health Products & Services (25, 0.2)net_income1999 > 19 => Leisure (25, 0.2)... Based on NaïveBayes Classification For cities,differentfeatures Changho Choi, University at Buffalo
Result: FOIL Experiments(Fist Order Inductive Logic) • Example • computer-software-&-services(A) :- hq-city(A,B),B<>fremont, competitor(A,C),hq-city(C, Islandia), not(employees_binned(A,?,?)). • It means thatcompanies headquartered somewhere other than Fremont competing with “Computer Associates International” are in the computer software & services sector.(“Computer Associates International” is the only company in our knowledge base headquartered in Islandia.) Changho Choi, University at Buffalo
Discussion • Difficulties • data cleaning • errorful nature of our facts • feature selection • Pleased result • the interaction between the symbolic features and the statistically-derived(naïve Bayes) features Changho Choi, University at Buffalo
Further Work • This paper suggests • a number of research directions • , impacting each of information extraction, machine learning, and data-mining from text • Further work • Extracting information from wrapped web-sites as a source of training data • Automatic data-cleaning of tracted features • Extending the information extraction Changho Choi, University at Buffalo
Reference(#1/2) • FOIL • Three companions for first order data mining • http://www.cs.kuleuven.ac.be/~ml/Doc/Tutorial_Summer/tutorial_summer.html Changho Choi, University at Buffalo
Reference(#2/2) Feature Sample URL hoovers-sector http://www.hoovers.com/sector/ hoovers-industry http://www.hoovers.com/industry/list/ hoovers-type http://www.hoovers.com/company/dir/0,2116,15694,00.html address http://www.hoovers.com/co/capsule/5/0,2163,12475,00.html City, state same competitor same subsidiary http://www.hoovers.com/premium/profile/5/0,2147,12475,00.html products same officers same auditors same revenue http://www.hoovers.com/hoov/join/sample_historical.html Net-income same Net-profit same employees same Changho Choi, University at Buffalo