1 / 55

Integrating Text Into an Enterprise IT Environment February 25, 2003

Integrating Text Into an Enterprise IT Environment February 25, 2003. Curt Monash, Ph.D. President Monash Information Services curtmonash@monash.com www.monash.com. Agenda for this talk. How text indexing and search work – and what they assume Fitting text into a traditional IT context

ruthjacobs
Télécharger la présentation

Integrating Text Into an Enterprise IT Environment February 25, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Text Into an Enterprise IT EnvironmentFebruary 25, 2003 Curt Monash, Ph.D. President Monash Information Services curtmonash@monash.com www.monash.com

  2. Agenda for this talk • How text indexing and search work – and what they assume • Fitting text into a traditional IT context • Sorting out your text application needs • Key considerations in text application architecture

  3. There are no miracles or magic bullets • “Search engines” aren’t the answer • “Content management” isn’t the answer • Clustering isn’t the answer • XML isn’t the answer No one technology solves all search problems

  4. Gresham’s Law of Coinage Bad (i.e., debased) coinage drives out good

  5. Monash’s Law of Jargon Bad uses of (recently coined) jargon drive out good ones • Example: “Content management” can mean almost anything

  6. Best practices for text apps are the same as for any other major IT challenge • Understand your application needs • Use safely proven technology where you can • Push the boundaries of technology where you must • Ask your users to make small changes in the way they do their jobs

  7. Key takeaways • The classical “technology stack” is evolving nicely to accommodate text • Standalone search-in-a-box doesn’t solve very many problems • Careful application analysis is crucial • It’s not just data design and workflow • Security needs to be designed in

  8. Part 1 How text indexing and search work – and what they assume

  9. Different application contexts • Different kinds of problems • Different available resources

  10. Recall vs. Precision • Recall = What percentage of the valid hits did you get? • Crucial if you actually need 100% • Precision = What percentage of the (top) hits returned really are valid? • Important for user satisfaction and efficiency • But how is “valid” measured???

  11. Three fundamentally different scenarios • Article search • Web search • OLTP application text search

  12. Article search • Very high recall may be needed • Metadata may be reliable • Document style and structure may be predictable This is the “traditional” information retrieval challenge

  13. Successful only in clearcut research markets • Legal – Lexis • Investments • Simple-minded apps • Stock symbols are the perfect keyword • Intelligence community? • Business “competitive intelligence” • Scientific/medical

  14. The “Daily Me” hasn’t arrived yet • How well does the user understand information retrieval? • Who has time to read anyway? • Failures include Newsedge, Northern Light, et al. • “Personalized” portals are wimpy, and nobody seems to care

  15. Web search • Precision is usually a bigger problem than recall (300,000 hits!) • Metadata is unreliable (no standards, deliberate deception) • Style and structure are enormously varied

  16. Users like Google But how are they using it? • What they’re finding is good web sites • They still have to navigate to the specific page

  17. OTLP app text search • 100% precision is assumed for the overall app … • … so text search had better not be the only way to find documents • The relational record probably is the metadata • Hot future area • Usage is creeping up • Functionality is still primitive • App dev tools are improving dramatically, albeit from a dismal starting point

  18. Lessons from Amazon.com • Search-based navigation can work • The user needs a clear understanding of what s/he is looking for • If you make an imprecise query, you have to accept an imprecise result set

  19. It all starts with word search • Big, specialized inverted-list index • Huge but sparse • Analogous to bit-map or star schema • Digrams/trigrams/n-grams, offsets, stopwords • Fortunately, integration into RDBMS has been largely solved

  20. The ranking problem • What does 75% relevance mean? • How do you combine rankings from different subsystems? • The SAME query against the SAME data can give different results in different search engines • The SAME query against the SAME search engine can give different results if you add irrelevant data

  21. Major issues for (key)word search • Ambiguity • Vagueness • Information overload

  22. Major tools • Traditional linguistic techniques • “Automagic” clustering • Traditional metadata • Socioheuristics

  23. Traditional linguistic techniques • Synonyms and other semantic clues • Topic sentences and other syntactic clues • Standard document structure

  24. Query translation/expansion • Thesaurus • End-user extensible • Spelling correction • Traditional (e.g., drop the vowels) • Modern (e.g., compare to query logs)

  25. Automagic clustering and information discovery • Nice mathematical buzzwords • Bayesian statistics, etc. • It all boils down to “distance” measured in a very high-dimensional vector space • Nice social science buzzwords too • Semiotics, etc. • Same appeal as neural networks • The computer “discovers” what humans can’t

  26. Clustering technology isn’t sufficiently advanced yet to be “magic” • Same weaknesses as neural networks too • Lack of reliability • Lack of transparency • Lack of predictability! • Legacy of failure • Search engines: Excite, Northern Light • “Employee Internet Management” (i.e., porn/gambling filter) companies

  27. Traditional metadata • Typically supplied by the author/editor, or by a librarian • Keywords, etc. • Who/What/Where/When

  28. Socioheuristics • Measures of page popularity • Guesses at author expertise

  29. Sorting through the metadata Since unaided “search” often works badly, metadata is crucial

  30. So what is metadata in a search context? • Standard definition of “metadata”: Data about data • Actually, relational metadata usually is data about data structures • But in the text world, metadata usually is data about the data itself

  31. Categories of text metadata • Library-like • Extracted from the document • Implicit in the corpus • OLTP-like

  32. Classical document metadata • Comes from the library tradition (i.e., card catalogs) … • … and/or from early online document stores used by librarians • Examples: • Title, author, date, etc. • Hand-selected classification/categorization • Hand-selected keywords • Can be created by author, editor, “librarian”

  33. Extracted metadata • In essence, precomputed text search • Examples: • Key words (or keywords) and concepts • Titles and metatags • Topic sentences, summaries • Author, etc.

  34. Implicit metadata – location, location, location • Where on the net is the document? • Judge a document by its neighbors • Major problem – unstable net topography • URL patterns can’t be relied on, unfortunately • Google’s original algorithms were based on behavior analysis on the public WWW

  35. Automatic metadata in “traditional” OLTP apps • Examples • Comment fields in apps such as • CRM/call report • Maintenance/damage report • Web feedback forms • Limited more by application imagination than by the data itself

  36. Part 2 Fitting text into a traditional IT context

  37. Benefits of storage in standard DBMS • System management (e.g., backup, failover) • Standard programming languages/APIs • Security!!

  38. Old objections to DBMS-based storage are invalid • Performance -- Proprietary systems can’t index email in real time either • Specialized functionality – the DBMS have long feature lists too

  39. All enterprise data architectures are supported • Central everything • Central index, distributed storage • Distributed/federated everything

  40. Application development technology and tools are just emerging • SQL/MM • Search controls, etc. • Emerging XML-centric technology • Customizable “content management” systems

  41. Canned text apps are a mixed bag • Document management for regulatory filings • Information discovery • Generic search

  42. Part 3 Sorting out your application needs

  43. Different applications have very different profiles • Precision/recall of result • Quality of input • Security

  44. Basic application types, Group 1 – the fuzzies • Portal (e.g., self-service HR) • Best case for generic WWW-like search • Notes/Exchange/Email • Not clear what the real functionality needed is • Active area of research/development • Information discovery

  45. Basic application types, Group 2 – OLTP • Heavy-duty transaction processing (ERP, supply chain, etc.) • Search is tangential • Direct touch CRM • Basic search is underutilized but gaining ground • Online sales/marketing (very different in different industries) • Search part of the app unlikely to be very demanding … • … except from a security standpoint

  46. Basic application types, Group 3 – Heavy-duty analytic aids • BI/CPM/Analytic apps • Great for taming the numerical part of the information tangle • Text search is largely irrelevant • Product lifecycle management (engineering-centric) • Text is an afterthought • Product lifecycle management (regulatory-centric) • Documentum et al. offer “compliance” solutions • Online maintenance manuals • This is a biggie for text!!

  47. Part 4 Key considerations in text application architecture

  48. Five big issues • Database integration • Realistic options for document metadata • Document stylistic consistency (local) • Quality-of-search application requirements • Security

  49. Text database integration vs. relational database integration • Remote indexing is an option • Data cleaning and consistency issues are different • Performance issues are different • Everything is a little more primitive

  50. Document metadata – consider the source • Author/editor – can’t be relied on • Implicit metadata – great if you trust your policies/procedures • Extracted metadata – same strengths/weaknesses as general text search • From a relational OLTP app – nice if you have it

More Related