1 / 19

Data Mining on Symbolic Knowledge Extracted from the Web

Data Mining on Symbolic Knowledge Extracted from the Web. Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute. Abstract. This paper gives a case study of combining information Unstructured Information

woods
Télécharger la présentation

Data Mining on Symbolic Knowledge Extracted from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining on Symbolic Knowledge Extracted from the Web Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute

  2. Abstract • This paper gives a case study of combining information • Unstructured Information • an errorful source of large amounts of potentially useful information • Structured Information • less up-to-date, but reliable as facts • Using information from two kinds of sources • Improves the reliability of data-mined rules Changho Choi, University at Buffalo

  3. Introduction (#1/2) • Challenge • not only gather and represent knowledge existing on the Web, • but also use that knowledge for planning, acting, and creating new knowledge Changho Choi, University at Buffalo

  4. Introduction (#2/2) • First stage • integrating three types of information gathering • Extracting propositional knowledge from highly-structured automatically-generated web pages • Extracting propositional knowledge from free-form, unstructured data sources • Extracting relational knowledge existing on the Web through a combination of web pages and their hyperlink structure • Aim • identify patterns of knowledge that were not explicitly represented as facts on the Web Changho Choi, University at Buffalo

  5. Data sources and features • Extracted features • come directly from crawling the company Web sites • e.g. performs-activity, links-to, officers, sector, location, ... • Wrapper features from secondary sources • rely on a mostly regular format • e.g. hoovers-sector, hoovers-industry, hoovers-type, address, ... • Abstracted features • describe relationships between companies • discretize our continuous features • e.g. same-state, same-city, share-officers, mentions-same, ... Changho Choi, University at Buffalo

  6. Process of acquiring potentially interesting information about companies from the Web 4312 web sites50 pages on each siteswww.3com.com Data Mining The Web Extracting fromcorp. Web sites KB New knowledge Wrapping fromcorp. info. Company informationfrom www.hoovers.com Abstracting features Changho Choi, University at Buffalo

  7. Extracted Features Changho Choi, University at Buffalo

  8. Wrapped Features Feature Values Description hoovers-sector 28 Sector listed on the company’s Hoovers page. hoovers-industry 298 Industry listed on the company’s Hoovers page. hoovers-type 18 Public, private, school etc. address Address as listed on hoovers. City, state Extracted form address. competitor Companies that compete with this company. subsidiary Companies listed as subsidiaries of this company. products 4648 Product categories extracted from the products page. officers Officers listed on the Hoovers page. auditors 266 Company auditors. revenue Revenue data for up to the last 10 years. Net-income Net Income data for up to the last 10 years. Net-profit Net Profit data for up to the last 10 years. employees Number of employees each year for up to the last 10 years. Changho Choi, University at Buffalo

  9. Abstracted Features Changho Choi, University at Buffalo

  10. Data mining algorithms • Discovering associations • by applying the Apriori algorithm • Learning propositional rules • by using the C5.0 algorithm • , which generates a decision tree for the given dataset • Learning relational rules • by using Quinlan’s FOIL system • , which can use patterns in the relationship between companies Changho Choi, University at Buffalo

  11. Experimental results • Apriori Experiments • discover associations in the data using association rules • Decision Trees • generate propositional rules using Decision trees • FOIL Experiments • generate first order rules using the first order rule learning system Changho Choi, University at Buffalo

  12. Result:Apriori Experiments (#1/2) • Threshold • minimal support:10%, minimal confidence: 80% • Some Examples • Highest confidence rule =>intuitively be understood • performs-activity = sell :- locations = united-states,links-to = adobe-systems-incorporated (10.8%, 93.0%)performs-activity = sell :- performs-activity = technical-assistance,links-to = adobe-systems-incorporated (11.8%, 91.1%) Changho Choi, University at Buffalo

  13. Result:Apriori Experiments (#2/2) • Some Examples • Normal rule • performs-activity = sell :- locations = japan (14.5%, 90.8%)performs-activity = research :- locations = japan (14.5%, 90.8%) • Lower support or conficence rule • performs-activity = research :- locations = united-states (26.9%, 72.5%) • hoovers-sector = food-beverage-&-tobacco :- competitor = conagra-inc (1.0%, 89.8%)hoovers-sector = retail :- competitor = kmart-corporation (1.0%, 75.0%)hoovers-sector = energy :- competitor = bp-amoco-p.l.c. (1.1%, 73.0%) Meaningful? Changho Choi, University at Buffalo

  14. Result: Decision Trees • Example : Predict the economic sector • city atlantarevenue1996 <= 0.1 => Diversified Services (28, 0.179)revenue1996 > 0.1 => Computer Software & Services (20, 0.2)city Houstoncoarse-sector [basic-materials, capital-goods, transportation] => Manufacturing (10, 0.3)coarse-sector [financial, healthcare, technlogy] => Computer Software & Services (21, 0.238)coarse-sector [conglomerates, consumer-cyclical, consumer-non-cyclical, energy, services, utilities] => Energy (49, 0.49)city Dallasnet_income1999 <= 19 => Health Products & Services (25, 0.2)net_income1999 > 19 => Leisure (25, 0.2)... Based on NaïveBayes Classification For cities,differentfeatures Changho Choi, University at Buffalo

  15. Result: FOIL Experiments(Fist Order Inductive Logic) • Example • computer-software-&-services(A) :- hq-city(A,B),B<>fremont, competitor(A,C),hq-city(C, Islandia), not(employees_binned(A,?,?)). • It means thatcompanies headquartered somewhere other than Fremont competing with “Computer Associates International” are in the computer software & services sector.(“Computer Associates International” is the only company in our knowledge base headquartered in Islandia.) Changho Choi, University at Buffalo

  16. Discussion • Difficulties • data cleaning • errorful nature of our facts • feature selection • Pleased result • the interaction between the symbolic features and the statistically-derived(naïve Bayes) features Changho Choi, University at Buffalo

  17. Further Work • This paper suggests • a number of research directions • , impacting each of information extraction, machine learning, and data-mining from text • Further work • Extracting information from wrapped web-sites as a source of training data • Automatic data-cleaning of tracted features • Extending the information extraction Changho Choi, University at Buffalo

  18. Reference(#1/2) • FOIL • Three companions for first order data mining • http://www.cs.kuleuven.ac.be/~ml/Doc/Tutorial_Summer/tutorial_summer.html Changho Choi, University at Buffalo

  19. Reference(#2/2) Feature Sample URL hoovers-sector http://www.hoovers.com/sector/ hoovers-industry http://www.hoovers.com/industry/list/ hoovers-type http://www.hoovers.com/company/dir/0,2116,15694,00.html address http://www.hoovers.com/co/capsule/5/0,2163,12475,00.html City, state same competitor same subsidiary http://www.hoovers.com/premium/profile/5/0,2147,12475,00.html products same officers same auditors same revenue http://www.hoovers.com/hoov/join/sample_historical.html Net-income same Net-profit same employees same Changho Choi, University at Buffalo

More Related