1 / 21

Classification & Your Intranet: From Chaos to Control

Classification & Your Intranet: From Chaos to Control. Susan Stearns Inmagic, Inc. www.inmagic.com sstearns@inmagic.com E-Libraries E204 May, 2003. Look familiar?. ?#$%&*. How do we manage all the information we want to publish on the Intranet?. Why can’t I find anything

mdeans
Télécharger la présentation

Classification & Your Intranet: From Chaos to Control

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. www.inmagic.com sstearns@inmagic.com E-Libraries E204 May, 2003

  2. Look familiar? ?#$%&* How do we manage all the information we want to publish on the Intranet? Why can’t I find anything on the Intranet? User Content Manager

  3. The issue: • We know that we have documents in-house that contain hugely valuable information • The problem: • How do users find the right information at the right time? • The answer: • Automated spider and meta- classification software that allow an enterprise to automatically build and maintain a completely searchable database of critical content.

  4. Automated Spidering Software • A spider that “crawls” specified in-house servers and Web sites • Extracts content from most popular file types and formats – HTML, Text, MSOffice, PDF – even e-mail • Content can be loaded into a database

  5. Key features to look for in a spider • Document types: Microsoft Office, PDF, other formats (IFilter compatible) • Zip files and email folders can be crawled • Remote administration • Can be scheduled to run multiple times a day • Web crawling can be set up to “n” levels deep • Easy to create an XML transform to your database design • Integrates with automated classification software

  6. How a spider works Web crawl Native document cache The Spider Extracted text cache XML load files File system crawl Database Content Manager can add value (e.g. add additional meta-tags, etc.) Users can search and access “Gathered” content via a Web-browser

  7. We Love Search: We Hate Search • Search is ubiquitous but insufficient • Only one slice into content • Missing relationships across information • Few are skilled at searching

  8. The search engine paradox: “Regardless of the product or a user's ability to use it, effective searches require the user to know the terms they need to use before they type them into the search engine.” The Delphi Group

  9. The Solution: Meta-Classification • Enrich the content with meta-data • Leverage XML and integrate content from multiple sources • Extract other useful concepts • Give users browse-able directories in addition to a search box

  10. What is Meta-Classification? • Automated meta-data extraction • Meta-data includes “subject information” as well as names of people, company names, acronyms, key noun phrases • Auto-classification of documents using a predefined taxonomy • This meta-data can be mapped to a database along with the full-text of the document or a URL link

  11. Why Meta-Classification? • Creates structured information from unstructured data • Allows local terminology to be reflected in searching • Provides a browse-able directory • Greatly enhances search through controlled vocabulary

  12. How does it work? • Spider/crawl the documents to create a corpus • Automated software • Identifies key words and phrases • Maps them to known topics in taxonomy • Scores the topics and derives a central theme • Repeat for the sub-themes

  13. Step 1: Identify words and phrases in the text MicrosoftNASDAQ:MSFT, which won a round in its antitrustfight against thegovernment today,launched its Microsoft.Net initiative that could someday replace computer hardware with software. Via XML(extensiblemarkup language), Microsoft.net will enable use of much larger computers accessible on the Internet for storage of programs, word processing files and other data.

  14. Step 2:Map them to topics within the taxonomy Government: government Computer Science: Internet, XML Hardware: computers Storage: data, storage Application Files: programs, language Word Processing Files: word processing files Software Companies: Microsoft Microsoft: Microsoft.Net

  15. Step 3: Determine themes • End results of classification of this story are: • Central Theme: Microsoft • Sub-theme 1: Word Processing Files • Sub-theme 2: Software Companies • Microsoft is a good match for central theme • Microsoft.Net would have been the best classification • Original taxonomy didn’t know this topic • Will be added to taxonomy

  16. Customize the Taxonomy • 1 million node taxonomy often too large • Develop a custom taxonomy • A subset of the large taxonomy • Selected nodes to match business needs • A set of rules to aggregate from the low level topics in the large taxonomy to the custom taxonomy

  17. The Result • Very large corpus of content can be classified in automated fashion • Meta-data is used to create browse-able directories • Meta-data is used for searching • End user is given “clues” for finding the right information

  18. Other features to consider • Document summaries/abstracts • Including external content • Spidered from Web sites • Integrated from licensed content sources • User submissions • User ratings/reviews

  19. To Control User Interaction Search and Browse Corporate Intranet Meta Data: Captured centrally Context (entity extraction/auto-classification) Document Properties, Classification Web Content Content Collection (spider) Word Doc From Chaos

  20. Thank You. sstearns@inmagic.com

More Related