1 / 48

Sample Talks for Organizational Hints

Sample Talks for Organizational Hints. Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435. Overall R&D Agenda.

xerxes
Télécharger la présentation

Sample Talks for Organizational Hints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sample Talks for Organizational Hints Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435

  2. Overall R&D Agenda Develop semi-automatic techniques for information extraction/retrieval to enable man and machine to complement each other in assimilation of semi-structured, heterogeneous documents =>Semantic Web Technologies.

  3. A Modular Approach to Document Indexing and Semantic Search

  4. Goal (What?) • Background and Motivation (Why?) • Implementation Details (How?) • Evaluation and Applications (Why?) • Conclusions

  5. Goal

  6. Develop a modular approach to improving effectiveness of searching documents for information • Reuse and integrate mature software components

  7. Background and Motivation

  8. Improve recall using information implicit in the English language • Improve precision and recall using domain-specific information implicit in the document collection • Assist manual content extraction by mapping document phrases to controlled vocabulary terms (domain library) • NSF-SBIR Phases I and II with Cohesia Corp.

  9. Enable extensions • Spell check input query • Organize search results through grouping • Improve precision thro sense-disambiguation • Enable experimentation • Investigate empirical relationship between significant eigenvalues in the Singular Value Decomposition (SVD) and the number of document clusters using benchmarks.

  10. Implementation Details (How?)

  11. Tools Used • Apache’s Lucene APIs • A high-performance, Java text search engine library with smart indexing strategies. • WordNet and Java WordNet Library • NIST and MathWork’s Java Matrix package (JAMA) for LSI • Domain-specific controlled vocabulary for Materials and Process Specs

  12. Jazzy, a Java Open Source Spell-Checker • MEDLINE dataset • 20-Newsgroups dataset • Reuters-215781 newswire stories datasets

  13. Document collection Inverted DLIndex Document Indexer DL Term Locator LSA Term Matrix Domain Library Inverted Document Index Configurer Highlighter Searcher Output WordNet Query Modifier User query Architecture of Content-based Indexing and Semantic Search Engine

  14. Evaluation and Application (Why?)

  15. Enhanced search illustrating wildcard pattern and synonym expansion

  16. Matching DL Items; DL Term and its location in the document

  17. Example illustrating skippable group

  18. LSI and Clustering • Exploring relationship between the number of significant eigenvalues and the number of document clusters • 20-Mini-Newsgroup dataset • 2000 postings, 20 groups • Reuters-215781 Newswire Stories dataset • Used 2000 stories at a time, 70 topics

  19. Conclusions

  20. Useful assistance for manual content extraction from materials and process specs, given the controlled vocabulary • In future, this framework / infrastructure can be used for experiments with expressive and context-aware search.

  21. Formalizing and Querying Heterogeneous Documents with Tables

  22. Goal (What?) • Background and Motivation (Why?) • Implementation Details (How?) • Evaluation and Applications (Why?) • Conclusions

  23. Goal

  24. Define, embed, and use metadata in semi-structured documents containing tables. • Content-oriented/domain-specific metadata of human sensible document • Makes explicit semantics of complex data • Enables augmentation of an interpretation in a modular fashion.

  25. Heterogeneous Document

  26. Background and Motivation

  27. Embedding metadata improves traceability, thereby facilitating • Content Extraction • Verification • Update

  28. Implementation Details (How?)

  29. XML Technology • Document-Centric View: XML is used to annotate documents for use by humans in the realm of document processing and content extraction. • Data-Centric View: XML is used as text-based format for information exchange / serialization in the context of Web Services.

  30. Basic idea behind our approach • Unify the two views by using XML-elements to materialize abstract syntax, and together with XML attributes and XML element definitions, formalize the content. • Key advantage: Minimizes maintenance of additional data structures to relate original document with its formalization.

  31. Two Concrete Implementations • Use Web Services language Water which amalgamates XML Technology with programming language concepts • Use XML/XSLT infrastructure

  32. Water-based approach • Each annotation reflects the semantics of the text fragment it encloses. • The annotated data can be interpreted by viewing it as a function/procedure call in Water. The correspondence between formal parameter and actual argument is position-based. • The semantics of annotation is defined in Water as a method definition in a class, separately.

  33. Example Table

  34. Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table.<setHeading thickness strength.tensile strength.yield/> 0.50 and under 165 155 table.<addRow 0 0.50 165 155 /> 0.50 - 1.00 160 150 table.<addRow 0.50 1.00 160 150 /> 1.00 - 1.50 155 145 table.<addRow 1.00 1.50 155 145 /> ...

  35. Example of Processing Code <defclass table rows=required=vector heading=optional=vector> <defmethod setHeading t=required ts=required ys=required> <set heading=<vector t ts ys/>/> </> <defmethod addRow smin smax ts ys> <set rows= table.rows.<insert <vector smin smax ts ys/>/>/> </> <defmethod computeYieldStrength> … </> <defmethod computeTensileStrength> … </> … </>

  36. XML/XSLT-based approach • Each annotation reflects the semantics of the text fragment it encloses. • To make the annotated data XML compliant, dummy attributes such as one, two, three, … etc are introduced. The correspondence between formal attribute and the actual value is name-based. • The semantics is defined by interpreting XML-elements and its XML-attributes via XSLT, separately.

  37. Example of Tagged Table <table type="Tensile"> <dependency name="Yield Offset" value="0.2%"/> <tableSchema one="Thickness(min)" two="Thickness(max)" three="Tensile Strength“ four="Yield Strength"/> <tableUnits one="in" two="in" three="ksi" four="ksi" /> <tableData one="0" two="0.50" three="165" four="155" /> <tableData one="0.50" two="1.00" three="160" four="150" /> ... <\table>

  38. XSLT Stylesheets can be used to: • Query: to perform table look-ups. • Transform: to change units of measure such as from standard SI units to FPS units and vice versa. • Format: to display the table in HTML form. • Extract: to recover the original table. • Verify: to check static semantic constraints on table data values.

  39. Evaluation and Application (Why?)

  40. Advantage • Only tabular data in each document is annotated. The annotation definition is factored out as background knowledge. • Thus, the semantics of each table type is specified just once outside the document and is reused with different documents containing similar tables.

  41. Disadvantage • Both avenues require mature tool support for wide spread adoption. • For example, develop MS FrontPage like interface where the Master document is the annotated form, and the user explicitly interacts with/edits only a view of the annotated document, for readability reasons, and has support for export as XML to generate well-formed XML document.

  42. Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L =< Thickness, U > Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).

  43. Conclusion and Future Work

  44. Develop a catalog of predefined tables, specifying them using Semantic Web formalisms (such as RDF, OWL, etc) and mapping the tabular data into a set of pre-defined tables, possibly qualified. • Develop techniques for manual mapping of complex tables into simpler ones: • To provide semantics to data. • To improve traceability. • To facilitate automatic manipulation.

  45. Tailor and improve IE and IR techniques developed in the context of text processing to Semantic Web documents such as in XML, RDF, etc benefiting from additional support from ontologies such as in OWL, etc

  46. Our Related Publications

  47. K. Thirunarayan, A. Berkovich, and D. Sokol, An Information Extraction Approach to Reorganizing and Summarizing Specifications, In: Information and Software Technology Journal, Vol. 47, Issue 4, pp. 215-232, 2005. • K. Thirunarayan, On Embedding Machine-Processable Semantics into Documents, In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 7, pp. 1014-1018, July 2005.

  48. Holy Grail Ultimately develop principles, techniques and tools, to author and extract human-readable and machine-comprehensible parts of a document hand in hand, and keep them side by side.

More Related