660 likes | 814 Vues
Semi-Automatic Content Extraction from Specifications. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation. Extraction : Summarize in a prescribed vocabulary. Spec: Text. Spec: SDR.
E N D
Semi-Automatic Content Extraction from Specifications Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation
Extraction : Summarize in a prescribed vocabulary Spec: Text Spec: SDR Domain Library
Participants • Sponsor: National Science Foundation • SBIR: Phase I and Phase II • Industry: Cohesia Corporation • Developer of (B2B) content and lower-level infrastructure • University: Wright State University • User-level tools: conceptualization and design • Others: Geometric Software Solutions, … • Tool/Product development and integration
Outline • Background and Goal (What?) • Motivation (Why?) • Details (How?) • Conclusions
Manual Content Extraction • Input: • Paper-based specifications of a manufacturing task describing composition, processing, and testing of materials • Additional constraints imposed by customers and vendors • Appropriate Ontology and Domain Library defining standard vocabulary
Output: • An “equivalent” formalized description of specs in Specification Definition Representation (SDR) • Observation: • Specs originating from a common source (ASTM, SAE, GE) share vocabulary and structure. • Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.
Assistance for Extraction Document Paper Document Text Mark-Up Editor (Wizard) Document SDR Document Proofer original
Semi-automatic Content Extraction • Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR. • Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text. • Spec: Human-sensible • Mark-up: Computer-sensible • Automate routine mechanical tasks.
Semantic Mark-up Spec Name Spec Title Revision Procedure Revision Date Qualifier Characteristic Values Value
Ontology • (Gruber) • An ontology is an explicit specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.
SDL Ontology 1 or many Document Domain Library 1 or many Revision Reference Ref: 0, 1 or many 0, 1 or many Procedure 0, 1 or many Ref: 0, 1 or many Layer 0, 1 or many Ref: 0, 1 or many Characteristic Value
Extraction: Spec to SDR Spec: Text Spec: SDR
Fundamental Obstacles • The relation between the spec and its SDR rendition is “not linear”. • Same spec information duplicated in SDR in different contexts. • Contiguous block of information in SDR spread out in spec. • Equivalence of phrases hard to formalize. • Tables and footnotes abbreviate information in irregular and complicated ways.
Linearizing through Abstraction: Introducing Specification Definition Language Manual (original) Original Spec SDL SDR Manual (Ph-I) Compiled (Ph-I) Literal, Integrated, Semi-automatic (Ph-II) Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages. Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.
Drawing Drawing Spec Spec Business Background (Supply Chain) Engine Forger Metal
Quality Issues • Transcription Errors • From spec to hand-written sheet to computer • Completeness • Info in spec but missing in SDR • Soundness • Info in SDR but not in spec • Uniformity of Form • Uniformity in Interpretation • Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)
Efficiency Issues • Minimize time/effort required. • Automate routine mechanizable tasks • Eliminate “cut-paste-modify” cycle • Minimize duplication of information. • Concise representation • Size of translation = O(Size of spec). • Update consistency • Flexible rendition into various external forms.
Essence of our Approach : Literal Translation • Conceptually, every piece of info in SDR owes its existence to phrases in spec. • Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec. • Requires compilation into SDL/SDR. • Cf. XML/XSL Technology
Semi-automatic approach is feasible onlyif the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible. • Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.
SDL Studio and its Extension • SDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR. • This can be further enriched using • Improved Domain Library Search • Extraction and composition of SDL fragments • Providing templates for commonly occurring “procedures” • Table processor • etc …
Domain Library • Currently, it contains technical phrases pertinent to materials and processing requirements • Cohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc. • Typical size: 10,000 phrases
Improving Domain Library Search • Goal: Mapping “equivalent” phrases to same Domain Library Term • Uses: • Techniques for prefix removal, stemming, and dealing with other variations for root recognition • Stop words elimination • Abbreviation expander and alias normalization
begin dl := readAndBuildDomainLibrary(); dlwm := buildWordMapAndBackLinks(dl); % delete stop words, link words to DLTs (in,mt) := readInputPhraseAndMatchThreshold(); inwm := buildWordMap(in); dlts := buildDLTsListContainingMatchedWords(dlwm,inwm); dlts := evaluateAndFilterDLTs(dlts,mt); end; Algorithm Sketch List[Phrase] dl; Phrase ip; Int mt; List[Word] dlwm, inwm; % with back references List[Phrase] dlts;
Matching words Int wordMatch(w1,w2) begin % normalized = vowels deleted, i.e., only consonants present if caseUniformAndCleanedMatch(w1,w2) return 100; if normalizedMatch(w1,w2) return 90; if orderedNormalizedMatch(w1,w2) return 70; % analyze for differences due to prefix and suffix if normalizedDifferenceInPrefixSuffixTables(w1,w2) return 90; end;
Design Rationale • Input phrase may contain multiple DLTs. • DLT words may not appear contiguous in input. • Consonants are significant, and "correct" spellings may differ in vowels. • Robustness with respect to spelling errors such as transposition of letters or missing vowels. • Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically. • Err on the side of recall rather than precision. • Number of words < Number of DLTs
Overall Approach • Preprocessing: Obtain spec in plain text form (from MSWord format). • This is a practical alternative to scanning and OCR-ing a paper-based spec. • Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags. • Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.
Two possible Avenues(From Document to SDL) • Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology. • Generate various views of the document and SDL from this single XML Master. • Iteratively generate a sequence of progressively detailed SDL document from spec text.
First Avenue : Via XML • Semi-automatic extraction is accomplished in two phases: • Initial automatic markup phase: Systematically recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment. • Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL. • Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.
(cont’d) • Advantages: • Focus is on a single persistent XML Master that tries to maintain a link between the spec and the extractions. • All the processing is orchestrated on this XML file. • Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.
(cont’d) • Disadvantages: • There is a need to manage a separate SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.
Semantic-Markup Algorithm Insert Structure Tags Insert Ontology Tags Infer Missing Char. Group Char. & Values Group C-Vs into Procedures
Functional Components Text file Structure Tagger XML file DLT Tagger Domain Library XML file Group Tagger XML file SDL Converter SDL file
Tagging and Transforming • flex structTagger.flex • gcc lex.yy.c -lfl • a < GE.txt > GE.xml • java org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl • … • java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdl • java org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt
Second Avenue: SDL all along • As there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along. • Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor. • Disadvantage: This form does not retain correspondence with the original document explicitly.
Prototype Operation Extraction Tool – Prototype Operation
Views: In the context of Spec Plain text view Text view with “requirement” phrases color coded and highlighted View of domain library terms found in the spec Views: In the context of SDL Spec identity view + Large Note : Method D Extraction Method C Extraction Procedure view Characteristic-value pair view