CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system

CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system Minhua Eunice Ma Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Informatics University of Ulster, Magee

Objectives of CONFUCIUS • To interpret natural language story and movie (drama) script input and to extract conceptual semantics from the natural language • To generate 3D animation and virtual worlds automatically, with speech and non-speech audio • To integrate the above components to form an intelligent multimedia storytelling system for presenting multimodal stories

Story in natural language Storywriter /playwright Speech (dialogue) User /story listener Movie/drama script CONFUCIUS 3D animation non-speech audio Tailored menu for script input CONFUCIUS’ context diagram

Literature review

Previous systems • Schank’s CD Theory (1972) • Primitive & scripts • SAM & PAM • Automatic Text-to-Graphics Systems • WordsEye (Coyne & Sproat, 2001) • ‘Micons’ and CD-based language animation (Narayanan et al. 1995) • Spoken Image (Ó Nualláin & Smith, 1994) & its successor SONAS (Kelleher et al. 2000)

MultiModal interactive storytelling • AesopWorld • KidsRoom • Larsen & Petersen’s Interactive Storytelling • Oz • Computer games • Embodied intelligent agents • divergence on agents’ behavior production • BEAT (Cassell et al., 2000) • Gandalf • PPP persona

Architecture of CONFUCIUS Natural language stories Script writer Script parser Prefabricated objects (knowledge base) lexicon grammar etc Natural Language Processing Text To Speech Sound effects Language knowledge semantic representations 3D authoring tools mapping visual knowledge Animation generation visual knowledge (3D graphic library) Synchronizing & fusion 3D world with audio in VRML

MultiModal semantic representation Multimodal semantics High-level multimodal semantic representation: XML/frame-based Media-independent representation Visual media-dependent representation Intermediate level Audio media-dependent representation Non-speech audio modality Visual modality Language modality

Knowledge base of CONFUCIUS knowledge base Semantic knowledge - lexicons (eg. WordNet) Syntactic knowledge - grammars Statistical models of language Associations between words Language knowledge Object model (nouns) Functional information Internal coordinate axes (for spatial reasoning) Associations between objects Event model (event verbs, describes the motion of objects) Visual knowledge World knowledge Spatial & qualitative reasoning knowledge

Categories of events • Atomic entities • Change physical location such as position and orientation, e.g. “bounce”, “turn” • Change intrinsic attributes such as shape, size, color, and texture, e.g. “bend”, and even visibility, e.g. “disappear”, “fade” (in/out) • Non-atomic entities • Non-character events • Two or more individual objects fuse together, e.g. “melt” (in) • One object divides into two or more individual parts, e.g. “break” (into pieces) • Change sub-components (their position, size, color), e.g. “blossom” • Environment events (weather verbs), e.g. “snow”, “rain” • Character events • Action verbs • Intransitive verbs • Transitive verbs • Non-action verbs (stative, emotion, possession, mental activities, cognition & perception) • Idioms & metaphor verbs

involve speech modality Categories of action verbs • Intransitive verbs • Biped kinematics, e.g. “walk”, “swim”, & other motion models like “fly” • Face expressions, e.g. “laugh”, “anger” • Lip movement, e.g. “speak”, “say” • Transitive verbs • single object, e.g. “throw”, “push”, “kick” • multiple objects • direct and indirect objects, e.g. “give”, “pass”, “show” • indirect object & the instrument, e.g. “cut”, “hammer”

3rd level 2nd level Atomic level touch() moveToward(), alignMiddle(),alignTouch(), alignMax(), alignMin(), faceTo() Hierarchical structure of predicates move(), moveTo(), rotate(), scale(), squash() Basic predicate-arguments 8)alignMin(obj1, obj2, axis) 9)alignTouch(obj1, obj2, axis) 10) touch(obj1, obj2, axis) 11) scale(obj, rate) 12) squash(obj, rate, axis) 13) group(x, [y|_], newObj) 14) ungroup(xyList, x, yList) 1) move(obj, xInc, yInc, zInc) 2)moveTo(obj, loc) 3) moveToward(obj,loc,displacement) 4) rotate(obj,xAngle,yAngle,zAngle) 5)faceTo(obj1, obj2) 6)alignMiddle(obj1, obj2, axis) 7)alignMax(obj1, obj2, axis)

one many many many Visual definition & word sense polysemy verb word sense visual definition entry mapping synonymy • a normal door (rotation on y axis) • a sliding door (moving on x axis) • a rolling shutter door (a combination of rotation on x axis and moving on y axis) Example: “close” (a door) word sense -- minimal complete unit of meaning in the language modality visual definition entry -- minimal complete unit of meaning in the visual modality

DEF ball Transform { translation 0 0 0 children [ DEF ball-TIMER TimeSensor { loop TRUE cycleInterval 0.5 }, DEF ball-POS-INTERP PositionInterpolator { key [0, 0.5, 1 ] keyValue [0 0 0, 0 20 0, 0 0 0 ] }, Shape { appearance Appearance { material Material {} } geometry Sphere { radius 5 } }] ROUTE ball-TIMER.fraction_changed TO ball-POS-INTERP.set_fraction ROUTE ball-POS-INTERP.value_changed TO ball.set_translation } (c) Output  VRML code of a bouncing ball Implementation: semanticsVRML Example: “A ball is bouncing” bounce(ball):- [moveTo(ball, [0,0,0]), moveTo(ball,[0,20,0])]L. (a) visual definition of “bounce” DEF ball Transform { translation 0 0 0 children [ Shape { appearance Appearance{ material Material{} } geometry Sphere { radius 5 } } ] } (b) VRML code of a static ball

Comparison of intelligent multimedia systems

Software Analysis • Java programming language • parsing intermediate representation • changing VRML code to create/modify animation • integrating modules • Natural language processing tools • Gate (pre-processing) • PC-PARSE (morphologic and syntax analysis) • WordNet (lexicon, semantic inference) • 3D graphic modelling • existing 3D models on the Internet • 3D Studio Max (props & stage) • VRML (Virtual Reality Modelling Language) 97, H-anim 2001 spec. • The Actors – using embodied agents • Microsoft Agent (the narrator and minor actors) • Character Studio, Internet Character Animator (protagonists)

Pre-processing Coreference resolution Part-of-speech tagger LEXICON & MORPHOLOGICAL RULES Syntactic parser morphological parser Temporal reasoning Reuse NLP toolkits GATE 2.0 PC-PARSER FEATURES Semantic inference WordNet 1.6

Contribution & prospective applications Contribution • multimodal semantic representation of natural language • automatic animation generation • multimodal fusion and coordination Prospective practical applications • Children’s education • Multimedia presentation, • Movie/drama production, • Script writing, • Computer games, • Virtual Reality

Conclusion • The objectives of CONFUCIUS meet the challenging problems in language visualisation: • formalizes meaning of action verbs and states • mapping language primitives with visual primitives • a reusable ‘common senses’ knowledge base for other systems • sophisticated spatial and temporal reasoning • representing stories by temporal multimedia requires significant coordination

Project schedule

CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system