Practical Issues for Automated Categorization of Web Sites

Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

Outline • Project overview • Web content • Automated Categorization • Feature Selection • Metadata • Experimental Setup • Data • Targeted Spidering • System Architecture • Results • Conclusions

Project Overview • Specific: • Categorize large number of domain names by industry category • NAICS classification scheme • ~30,000 domain names for testing (.com) • Text categorization approach • General: • Domain specific classification • Metadata • Targeted spidering • Feature selection • Classifier training

Web Content: Automated Categorization • Challenges: • Vast (over 1 Billion pages) • Heterogeneous (content, formats, not just HTML) • Dynamic (growing, changing) • Benefits: • Good source of information • Accessible! • Machine readable (vs. machine understandable) • Semi-structured • Tools: • Classification • Automated classification • Text Categorization/Machine Learning • Intelligent agents • Related Work • Manual: • Yahoo! • Open Directory Project • Looksmart • Automatic: • Northern Light • Thunderstone/Texis • Inktomi • Other: • EU Project DESIRE II • Pharos • Attardi, Sebanstiani et al • L. Page et al • McCallum et al

Web Content: Feature Selection • Text Features: (D. Lewis) • Relatively few in number • Moderate in frequency of assignment • Low in redundancy • Low in noise • Related to semantic scope to the classes to be assigned • Relatively unambiguous in meaning • Preliminary Experiment • 1125 web domains • SEC+NAICS training set Use metadata if possible, use body text as last resort!

Web Content: Metadata

Experimental Setup: Targeted Spidering Domain name ‘Query’ Pages HTTP Get live? Yes No Try www. Frames? Yes Use <body> No Metatags? No Yes <a href=? Send Query prod, service, about, info, press, news

Experimental Setup: Data Classification scheme: NAICS 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific and Technical Services 55 Management of Companies and Enterprise 56 Admin. Support, Waste Mgmt and Remediation Srvcs 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment & Recreation 72 Accommodation and Food Services 81 Other services (except 92) 92 Public Administration 99 Unclassified Establishments • Test Data • ~30,000 domain names (SIC) • ~13,500 pre-classified/content • Training Data • “SEC-NAICS”: • 1504 SEC 10-K fillings (SIC) • 426 NAICS labels/descriptions • “Web pages”: • 3618 pre-classified domains • Crosswalk • SIC <-> NAICS

Spider Experimental Setup: System Architecture The Web Domain Names Text Query SEC-NAICS IR Engine Web pages Matching documents Decision Foo.com 11, 21, 23

Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged

Conclusions • Domain Specific Classification • Knowledge Gathering • Use of specialized knowledge • Targeted Spidering • Efficient use of resources • Extract key features, Metadata • Training • Prior knowledge • Bootstrapping • Classification • Robust, tolerant of noisy data • Benefits of Semantic Web • Better Metadata • Semantic linking & intelligent spidering

Practical Issues for Automated Categorization of Web Sites

Practical Issues for Automated Categorization of Web Sites

Presentation Transcript

Web Sites for Causes

Practical Issues

Review of Science Sites Hazard Categorization

Practical issues

Issues related to the development of accessible web sites

Types of Web Sites

Web sites for artists

Looking Under the Hood of An Automated Categorization Engine

Automated Verification of Practical Garbage Collectors

Document Categorization Issues

Web Sites for Teachers

Practical issues

Practical issues for optical systems

Practical Issues

Practical Issues

Practical issues

Types of Web Sites

Tools for Automated Verification of Web Services

Practical Issues of Classification

Automated Reconstruction of Industrial Sites

Web Sites for Education

Practical Issues