140 likes | 334 Vues
Document Engineering of Complex Software Specifications Mehrdad Nojoumian Supervisor: Professor T. C. Lethbridge University of Ottawa School of Information Technology and Engineering. June 4, 2007, MSc Thesis in Computer Science. Motivation and Goal. Problems triggering our motivation:
E N D
Document Engineering of Complex Software Specifications Mehrdad Nojoumian Supervisor: Professor T. C. Lethbridge University of Ottawa School of Information Technology and Engineering June 4, 2007, MSc Thesis in Computer Science
Motivation and Goal Problems triggering our motivation: Software Specifications: are dense and intricate (Numerous materials) have complicated structures (lots of tables, figures, lists, codes, etc) are difficult for browsing and navigating are mostly available in the PDF format or just a single hypertext page Major goal: Re-engineer PDF based documents (Specifications, Conf. Proceedings, e-Books, etc) Illustrate how to make more usable version of documents
Data Analyses Headings and the document index carry the most important words in a document UML Superstructure Specifications The most frequent words among headings Frequency of the previous words as found in the entire document The most frequent words in the doc. index Other OMG Specifications Sorted document and heading tokens based on their frequency in two separate lists Defined position of heading tokens among document tokens: P1, P2, …, PN MP: Mean of [P1…PN] NDT: Total number of document tokens Percentage = (MP * 100) / NDT Most frequent headings (# of occurrence > 2) are among the most frequent words in the entire doc
Document Transformation Transforming the raw input into a format more amenable to analysis (XML) Extracting and refining the structure Conversion Experiments: Tools: Adobe Acrobat Professional 7.8 Microsoft Word 2003 Stylus Studio XML Enterprise Suite ABBYY PDF Transformer 1.0 Criteria: Generality Low Volume Clean & Understandable Similarity to XML Having Good Clues
Logical Structure Extraction Java parsers Solved the mis-tagging problem which had been created during previous phase Extracted entire headings existing in the document bookmark Removed some information and XML tags Formed the document logical structure in a clean XML format
Hypertext pages & Text Extraction Produced multiple outputs for each Chapter, Section, Subsection, etc (1.html, 2.html, 2.1.html, etc) Generated table of contents for headings (use it as a frame) Connected hypertext outputs sequentially XPath expressions Programming approach Formed major document elements Anchors in long pages Figures and their captions Simple & Nested Lists Dynamic Tables
Concept Extraction UML Superstructure Specification UML class & package hierarchies extraction If the first child of a <Section> element contains the ‘Class Descriptions’ string then you can detect UML classes & packages in grandchildren of that <Section> element Other specifications: Common Warehouse Meta-model (CWM) UML Infrastructure (UML Inf.) Meta Object Facility (MOF) Question? How can we detect such a logical relation among heading elements automatically?
Cross Referencing Developed an XSLT program to extract heading phrases and their corresponding hyperlinks Filtered some phrases which had common substrings such as Association & AssociationClass Removed phrases which had many independent hypertext pages (different entries in user interfaces) Also applied package names just for UML Superstructure Specification in cross referencing as anchors Finally, developed a Java program to replace hyperlinks in generated HTML pages
Usability of User Interfaces Reasons for generating small hypertext pages: A better sense of location (navigating) Less chance of getting lost (scrolling) Less overwhelming sensation (learning) Statistical analyzing (interesting topics) Faster downloading (entire document!) Easier printing, Cross referencing among diverse specifications, etc User Interfaces Demo
Contributions A generic approach to reengineer complex documents A data analyses showing that words in headings provide a sufficient basis for the document reengineering Extraction of the document logical structure in XML format Various techniques for text & concept extractions using W3C technologies Major software components for an “Integrated Document Engineering Tool”
Engineering Lessons & Challenges Engineering Lessons: Generating a clean XML file from PDF images requires complicated features to recognize each document element correctly and deal with mis-tagging, page boundary, etc Remarkable role of latest technologies in engineering tasks: e.g. XPath 2.0 vs. parsing packages which is a high level interaction close to human’s language Comprehensive data analysis can facilitate the DocEng process, form a better understanding, and construct robust rules & regulations for such a processing Low Level Challenges: Generating multiple hypertext pages by Saxon Detecting errors in XSTL programming Creating complicated XPath expressions, etc
Future Work Extracting the initial XML document independently from Adobe Acrobat Automating the concept extraction procedure or creating some HCI features Developing an automatic document analyzer for comprehensive data analyses Investigating usability of current user interfaces to discover users’ demands Generating interaction features in UIs: online query submission to XML files
Publications Refereed Conference Paper: M. Nojoumian & T. C. Lethbridge, “Extracting document structure to facilitate a KB creation for UML specifications”, in proceedings of the 4th IEEE International Conference on Information Technology: New Generations (ITNG), pp. 393-400, Las Vegas, USA, 2007. Invited to publish in the Journal of Computers (JOC): M. Nojoumian & T. C. Lethbridge, “Document engineering of complex software specifications”, Academy Publisher.
Thank you very much Questions?