Open-Source Document Structuring Algorithm for NLTK

Open-Source Implementation of Document Structuring Algorithm for NLTK Nicholas FitzGerald

Natural Language Generation Generate coherent text outputs to express information Express the right information Express information in the right order

NLG Tasks Document Structuring - most important and relevant information selected from knowledge base (Content Determination), then ordered and structured in such a way as to maximize coherence and informativeness (Text Planning) Micro-Planning – specifics of word selection, referring expressions, and the finalization of ordering are determined Realization – internal representations of the above decisions are realized in actual text output

Document Structuring Given a set of information to be expressed, determine the order and grouping of this information Texts cannot be simply a random bag of sentences Order of message presentation has significant effect on meaning [Hovy 1993]: One way: 1 - “Maria was diagnosed with cancer some months ago.” 2 - “Maria and Zurab had a fight last night.” 3 - “She was found dead this morning.” Vs. 1 - “Maria was diagnosed with cancer some months ago.” 2 - “She was found dead this morning.” 3 - “Maria and Zurab had a fight last night.”

Document Structuring Ordering also effects coherence: “John was hungry. John went to the store. He bought some bread to make a sandwich.” “John bought some bread to make a sandwich. He went to the store. John was hungry.”

Discourse relations Relationship between a message or group of messages Elaboration(m1,m2) I love jazz music(m1). My favourite album is Oscar Peterson's “Night Train” (m2). Contrast(m1, m3) I love jazz music (m1). However, my favourite album is The Beatles' “White Album” (m3). Cue word - However

Rhetorical Structure Theory Mann and Thompson 1988 A text is coherent by virtue of relationships that hold between messages in the text A small number of relations (~25) can explain relationships between messages in a wide range of text

Project Proposal Implement these general algorithms for inclusion in NLTK Provide a sample Data Set and DR schema for testing and illustration based on hypothetical WeatherExplainer from [Reiter and Dale 2000] Experiment utilizing these new tools as part of Abstractive Summarization System for Evaluative Statement Summarization (ASSESS)

Implementation 1: Schemas Top-Down Approach Output document structure is predictable and stereotyped Schemas are patterns of expansion, similar to CFG Ie: CompareAndContrast → DescribeRelationship CompareProperties. CompareProperties → CompareProperty CompareProperties. CompareProperties → . “John is much bigger than Kate (DR). He is five inches taller (CP) and weighs almost twice as much (CP).” Specify rules for choosing if multiple expansions exist

Top-Down Problems Hypothesis-Driven Content selection done “on-line” Not easily pipelined Therefore, Bottom-Up used

Implementation 2: Bottom-Up Output document structure is not predictable POOL = messages to be expressed while( size(pool) > 1)): find all pairs of elements in pool which can be joined by a DR assign a desirability score to each potential DR find pair Ei and Ej with highest score and combine with Ek remove Ei and Ej from POOL, replace with Ek end while

Implementation Used nltk.featstruct for Messages and DocPlans A mapping from feature identifiers to feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure. TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+' [ *msgType* = 'TotalRainfallMsg' ] [ ] [ [ direction = '+' ] ] [ [ ] ] [ attribute = [ magnitude = [ number = 4 ] ] ] [ [ [ unit = 'inches' ] ] ] [ [ ] ] [ [ type = 'RelativeVariation' ] ] [ ] [ period = [ month = 6 ] ] [ [ year = 1996 ] ]

Implementation nltk.featstruct.FeatStruct unify(other): Unify fstruct1 with fstruct2, and return the resulting feature structure. This unified feature structure is the minimal feature structure that: contains all feature value assignments from both fstruct1 and fstruct2. preserves all reentrance properties of fstruct1 and fstruct2. If no such feature structure exists (because fstruct1 and fstruct2 specify incompatible values for some feature), then unification fails, and unify returns None.

Unification TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+' TotalRainfallMsg period year 1996 Month 06 attribute type 'RelativeVariation' direction '+' + =

Implementation nltk.featstruct.FeatStruct subsumes(other): True if self subsumes other. I.e., return true if unifying self with other would result in a feature structure equal to other.

Subsumes TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+' TotalRainfallMsg period year 1996 Month 06 subsumes TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 TotalRainfallMsg period year 1996 month 06 Does not subsume

Using Subsumes ”Select from messages all DocPlans whose with a relType of Contrast and a nucleus which is a message of msgType ('TotalRainfallMsg')” d = DocPlan(relType = 'Contrast', nucleus = Message('TotalRainfallMsg')) return = filter(lambda msg: d.subsumes(msg), messages)

Implementation: Input Formats Messages: TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+'

Input Formats Rules: inputs Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2) (M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3 conditions return heuristic

Example Usage with open('msg_file', 'r') as f: msg_string = f.read() with open('rule_file', 'r') as f: rule_string = f.read() messages = read_messages(msg_string) rules = read_rules(rule_string) plan = bottom_up_plan(messages, rules)

Data Set - WeatherExplainer Simple example provided in [Reiter and Dale 2000] Created 3 messages and 3 rules in the input format

WeatherExplainer Messages TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+' MonthlyRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 2 direction '+' MonthlyTemperatureMsg period year 1996 month 06 temperature category 'hot'

WeatherExplainer Messages Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2) (M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3 Contrast(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2) (M1.attribute.direction != M2.attribute.direction) : ConstituentSet('Contrast', M1, M2) : 2 Sequence(Message('MonthlyTemperatureMsg')|ConstituentSet(nucleus=Message('MonthlyTemperatureMsg')) M1, Message('MonthlyRainfallMsg')|ConstituentSet(nucleus=Message('MonthlyRainfallMsg')) M2) () : ConstituentSet(Sequence, M1, M2) : 1

WeatherExplainer Result [ *type* = 'DPDocument' ] [ ] [ [ [ [ *msgType* = 'TotalRainfallMsg' ] ] ] ] [ [ [ [ ] ] ] ] [ [ [ [ [ direction = '+' ] ] ] ] ] [ [ [ [ [ ] ] ] ] ] [ [ [ [ attribute = [ magnitude = [ number = 4 ] ] ] ] ] ] [ [ [ *aux* = [ [ [ unit = 'inches' ] ] ] ] ] ] [ [ [ [ [ ] ] ] ] ] [ [ [ [ [ type = 'RelativeVariation' ] ] ] ] ] [ [ [ [ ] ] ] ] [ [ [ [ period = [ month = 6 ] ] ] ] ] [ [ [ [ [ year = 1996 ] ] ] ] ] [ [ [ ] ] ] [ [ *aux* = [ [ *msgType* = 'MonthlyRainfallMsg' ] ] ] ] [ [ [ [ ] ] ] ] [ [ [ [ [ direction = '+' ] ] ] ] ] [ [ [ [ [ ] ] ] ] ] [ children = [ [ [ attribute = [ magnitude = [ number = 2 ] ] ] ] ] ] [ [ [ *nucleus* = [ [ [ unit = 'inches' ] ] ] ] ] ] [ [ [ [ [ ] ] ] ] ] [ [ [ [ [ type = 'RelativeVariation' ] ] ] ] ] [ [ [ [ ] ] ] ] [ [ [ [ period = [ month = 6 ] ] ] ] ] [ [ [ [ [ year = 1996 ] ] ] ] ] [ [ [ ] ] ] [ [ [ *relType* = "'Elaboration'" ] ] ] [ [ ] ] [ [ [ *msgType* = 'MonthlyTemperatureMsg' ] ] ] [ [ [ ] ] ] [ [ *nucleus* = [ period = [ month = 6 ] ] ] ] [ [ [ [ year = 1996 ] ] ] ] [ [ [ ] ] ] [ [ [ temperature = [ category = 'hot' ] ] ] ] [ [ ] ] [ [ *relType* = 'Sequence' ] ] [ ] [ title = [ text = None ] ] [ [ type = None ] ]

WeatherExplainer Result Roughly: ”This has been a hot month. Average rainfall this month is greater than usual. So far, rainfall is four inches above average.”

ASSESS Summarization of Evaluative Opinions

An Abstractive Summarization Pipeline Data Input Reviews Summary Extract all Information from input corpus Determine most relevant information and generate summary

ASSESS Testing Input: Review sentences tagged with crude-feature evaluations Crude-Feature to User-Defined-Feature mapping Simple content selection Group evaluations by UDF Calculate average evaluation Also include info on UDF-parent in hierarchy, number of evaluations

Example Message [ *msgType* = 'AverageOpinionMessage' ] [ numOpinions = 17 ] [ polarity = '-' ] [ udf = 'Universal Remote Control' ] [ udf_parent = 'Extra Features' ] [ valence = 1.1764705882352942 ] 12 messages generated

Rules Conjunction(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf_parent == M2.udf_parent and M1.polarity == M2.polarity):ConstituentSet(Conjunction,M1,M2):(2,M1.numOpinions+M2.numOpinions) Contrast(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf_parent == M2.udf_parent and M1.polarity != M2.polarity):ConstituentSet(Contrast,M1,M2):(3,M1.numOpinions+M2.numOpinions) Explanation(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf == M2.udf_parent and M1.polarity == M2.polarity):ConstituentSet(Explanation,M1,M2):(5,0) Explanation(Message('AverageOpinionMessage') M1, ConstituentSet(relType = 'Conjunction', nucleus=Message('AverageOpinionMessage')) M2) (M1.udf == M2.nucleus.udf_parent and M1.polarity == M2.nucleus.polarity):ConstituentSet(DExplanation,M1,M2):(10:0) Sequence(Message('AverageOpinionMessage')|ConstituentSet() M1, Message('AverageOpinionMessage')|ConstituentSet() M2) ():ConstituentSet(Sequence,M1,M2):(1,0)

ASSESS Result It works! Evaluation of resulting DocPlan would say more about Rules and Content Selection than Document Structuring Algorithm Was able to handle larger number of messages and rules 4 of 5 rules used Still, only one message type used

Future Improvements Investigate whether this simple framework can be used to develop more “intelligent” rules for more sophisticated domain models [Carenini 2008] – SEA May require changes to implementation Complete comprehensive documentation and user-manual Submit to NLTK

References Bird, Steven; Ewan Klein; Edward Loper (2009). Natural Language Processing with Python. O'Reilly Media Inc. Print and online. Carenini, G., Moore, J.D., (2006) Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11): 925- 952 Carenini, G., Ng, R., and Pauls, A. (2006) Multi-Document Summarization of Evaluative Text. Proc. of the Conf. of the European Chapter of the Association for Computational Linguistics. FitzGerald, N. (2009) A Complete Pipeline for Semantic Evaluation Summarization. Unpublished Project Report Lester, J. And Porter, B., (1997). Developing and empirically testing robust explanation generators: the KNIGHT experiments. Computational Linguistics, 23(1):65-101 Mann, W. and Thompson, S. (1988) Rhetorical structure theory: toward a functional theory of text organization. Text 3: 243-281. Marcu, D. (1997) From local to global coherence: A bottom-up approach to test planning. Proceedings of Fourteenth National Conference on Artificial Intelligence (AAAI-1997), 629- 635. Pitler, Emily et al (2008). Easily Identifiable Discourse Relations. University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-08-24. Reiter, E. and Dale, R. (1997) Building applied natural language generation systems. Natural Language Engineering 3 (1): 57-87. Reiter, E., and Robert Dale. Building Natural Language Generation Systems (Studies in Natural Language Processing). New York: Cambridge UP, 2000. Print. Young, R.M., Moore, J.D. DPOCL: A principled approach to discourse planning, in: Proceedings of the 7th International Workshop on Natural Language Generation, Kennebunkport, ME, June 17–21, 1994, pp. 13–20.

Open-Source Document Structuring Algorithm for NLTK

Open-Source Document Structuring Algorithm for NLTK

Presentation Transcript

Open Source Software for

Overview of Pegasus An Open-Source WBEM implementation

Girtools: open source GIR reference implementation

Overview of Pegasus An Open-Source WBEM implementation

Source Document

Open Source for Libraries

Exploring an Open Source Automation Framework Implementation

Overview of Pegasus An Open-Source WBEM implementation

Overview of Pegasus An Open-Source WBEM implementation

Best Business Practices for Implementation of Open Source Web Mapping Technology

Open Source for Educators

SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)

Open Communication for Open Source

RCSJTA implementation Open source management

Map matching algorithm for data conflation – an open source approach

Open Source for Interoperability

Implementation of an Open-Source Internet Map Server

Exploring an Open Source Automation Framework Implementation

Open-Source Implementation of Document Structuring Algorithm for NLTK