Passive Capture and Structuring of Lectures

Passive Capture and Structuring of Lectures Sugata Mukhopadhyay, Brian Smith Department of Computer Science Cornell University

Introduction • Multimedia Presentations • Manual • Labor-intensive • Experience-on-Demand (EOD) of CMU • Capture & abstract personal experiences (audio / video) • Synchronization of Audio, Video & position data

Introduction Contd. • Classroom 2000 (C2K, Georgia Tech) • Authoring multimedia documents from live events • Data from white boards, cameras, etc. are combined to create multimedia documents for classroom activities • Similarity (EOD & C2K) • Automatically capture • Author Multimedia documents

Introduction Contd. • Dissimilarity : • C2K: Invasive capture (Explicitly start capture), Structured environment (Specific) • EOD : Passive capture, unstructured

Motivation • Structured Multimedia document from seminars, talk, or class • Speaker can walk, press a button and give a presentation using blackboards, whiteboards, 35mm slides, overheads, or computer projection • One hour later, structured presentation on web

Overview • Cameras ( Encoded in MPEG format) • Overview camera (entire lecture) • Tracking camera (H/W built tracker), tracks speaker, capture head & shoulders • Upload slides to server (Speaker)

Overview Index • Video Region • RealVideo • Index • Title & duration of current slide • Synchronized with video • Prev / Next skip slides • Timeline • Boxes represents the duration of slide Video Slides Timeline

Problems Handled • Synchronization • Transitive (position of event A in a timeline) • A<->B => B can be add to same timeline • Synchronization error E1 = (A,B) and E2 = (B,C) => error (A,C) = E1 + E2 • Collected data • Timed (T-data, Video) • Untimed (U-data, Electronic slides)

Problems Handled Contd. • Synchronization • Time-timed Synchronization (TTS) • Two video streams • Timed-untimed Synchronization (TUS) • Slides with video • Untimed-untimed Synchronization (UUS) • Slides titles : Parsing the HTML produced by PowerPoint • Automatic Editing • Rule based structuring of Synchronized data

Δ2 V2(t) Δ V1(t) Δ1 Synchronization point To solve this, consider one or more Synchronization Points Timed-Timed Synchronization • Temporal link between streams captured from independent cameras

Camera Machine Sound Card Camera Machine MPEG Audio MPEG Audio Wireless Mic Receiver Right Left Left Right Speaker Audio Timed-Timed Synchronization Contd. Sync Tone • Artificial creation of Synchronization Point of duration 1 second • One of the channel in MPEG streams • Sound card is used for tone generation • Later, detection of the positions of tones in each stream.

Timed-Timed Synchronization Contd. • Detection of Synchronization Tone • Brute force approach (Fully decoding of MPEG Audio) • Proposed Method • Scale factors indicates overall volume of packets • Summing up Scale factors for volume estimation • Exceeds certain thresholds • Assuming MPEG-2 : worst error 26 ms (22.5 * 1152 microseconds) and max error 52 ms • Video: 30 FPS, e < 1/30 seconds

Timed-Timed Synchronization Contd. • Tighter bound (22.5 kHz) • Error <= 1/22.5 <= 44 micro sec < 26 ms (max error 26 ms) • For video of 15 FPS, max error 66 ms • Using this in MPEG System, a tone of 70 seconds can be located < 2 seconds

Use a tolerance of 0.5 sec for the synchronization Timed-Untimed Synchronization • Synchronization of slides with one of the video

Timed-Untimed Synchronization Contd. • Segmentation of slides from video of V(t) • Color Histogram • Slide having same background • Low resolution • Feature based Algorithm • Clipping frames, Low-pass filter, Adaptively thresholded • Let B1 and B2 is the two consecutive processed frames

Timed-Untimed Synchronization Contd. • Assumption : Slides contain dark foreground and light background • Applied to I-frame of MPEG video with 0.5 sec interval • Matching • Matching performed with the original slides for confirmation of slide change • Similarity > 95%, match declared & terminated • Similarity > 90%, highest similarity is returned • Too much noisy to match

Timed-Untimed Synchronization Contd. • Unwrapping • Video sequence contain foreshorten version of slides • Quadrilateral F -> Rectangle (size as original) • Camera & Projector fixed, corner points of F are same • Perspective transform -> Rectangle • Bilinear Interpolation (Rectangle)

Timed-Untimed Synchronization Contd.

Timed-Untimed Synchronization Contd. • Similarity • Hausdorff Distance • Dilation (radius 3) of pixels in original binary images • Setting all pixel to black in the dilation radius of any black pixels to count overlap (G) • b # of black pixels dilation (for extracted one, F) • b’ # of black pixels F & G • Forward match ratio = b’ / b • Similarly, reverse match ratio is calculated by dilating the F & keeping G (without dilating)

Timed-Untimed Synchronization Contd. • Evaluation • 106 slides, 143 transitions • Accuracy 97.2 % • Need to be tuned for dark background and light foreground

Automatic Editing • Combining captured videos into single stream • Constraints • Footage from overview must be shown 3 sec before and 5 second after slide change • 3 sec < any shot < 25 sec • Heuristic algorithm Edit Decision List (EDL) • Shot taken from one video source • Consecutive shots come from different video source • Shot: Start time, duration, which video source • Concatenating the footage of shots (final edited video)

Automatic Editing Contd.

Automatic Editing Contd. Shots from overview camera < 3 sec & separated from the tracking camera are merged Short from tracking camera > 25 sec are broken to 5 sec shots

Conclusion • Automatic Synchronization and Editing Systems • Classification of different kind of Synchronization • Slide change detection for dark foreground and light background (Textual part) • Slide Identification confirms slide change detection • Rotation and translation can affect the matching

Future Work • Motion vector analysis and scene cut detection (Trigger switch to overview camera) • Automatic enhancement to poor lighting • Orientation and position of speaker for editing • Shots from more cameras • Use of blackboards, whiteboards and transparencies

Looking at Projected Documents: Event Detection & Document Identification:

Introduction • Documents play major role in presentations, meetings, lectures, etc. • Captured as a video stream or images • Goal: annotation & retrieval using visible documents • Temporal segmentation of meetings based on documents events (projected): • Inter-documents (slide change, etc) • Intra-documents (animation, scrolling, etc) • Extra-documents (sticks, beams, etc) • Identification of extracted low-resolution document images

Motivation • Detection & identification from low-resolution devices • Extendable for documents on table • Current focus on projected documents • Captured as a video stream (Web-cam)

Slide Change Detection • Presentation slides as a video stream • Slides in a slideshow: same layout, background, pattern, etc. • Web-cam is auto-focusing (nearly 400 ms for stable image) • Variation of lighting condition

During Auto-focusing period Fading during auto-focusing Different slides with similar text layout Slide Change Detection (Cont’d)

Slide Change Detection (Cont’d) • Existing methods for scene cut detection • Histogram (color and gray) • Cornell method (Hausdorff Distance) • Histogram methods fail due to: a) low-resolution b) low-contrast c) auto-focusing d) fading • Cornell: Uses identification to validate the changes • Fribourg method: Slide stability • - Assumption : Slide visible  2 seconds  slide skipping

xN-1 xi xi+1 xi-1 xN-2 x0 x1 Check for Stability Stability Confirmation 0 1 2 i N i -1 i +1 N -1 0.5 s 0.5 s 2 s 2 s 2 s Proposed Slide Change Detection

Ground-Truth Preparation • Based on SMIL •  300 Slideshows collected from web • Automatic generation of SMIL file: Random duration of each slide • Contains slide id, start time, stop time and type (skip or normal)

<slideid="1" imagefile="Slide1.JPG" st="0000000" et="9.641000" type="normal" /> <slideid="2" imagefile="Slide2.JPG" st="9.641000" et="12.787199" type="normal" /> <slideid="3" imagefile="Slide15.JPG" st="12.787199" et="13.775500" type="skip" /> <slideid="4" imagefile="Slide11.JPG" st="13.775500" et="14.341699" type="skip" /> <slideid="5" imagefile="Slide25.JPG" st="14.341699" et="15.885400" type="skip" /> <slideid="6" imagefile="Slide20.JPG" st="15.885400" et="16.476199" type="skip" /> <slideid="7" imagefile="Slide9.JPG" st="16.476199" et="18.094100" type="skip" /> <slideid="8" imagefile="Slide3.JPG" st="18.094100" et="23.160102" type="normal" /> <slideid="9" imagefile="Slide4.JPG" st="23.160102" et="26.523102" type="normal" /> …….. An example of Ground-Truth SMIL file Evaluation • Ground-Truth: SMIL  XML • Slideshow video  Slide Change Detection  XML • Evaluation: Compare 1 & 2 • Metric used: Recall (R), Precision (P), F-measure (F)

4 Frame Tolerance 1 Frame Tolerance R:0.80, P:0.83, F:0.81 R:0.92, P:0.96, F:0.93 1 Frame Tolerance 1 Frame Tolerance Fribourg (R:0.84,P:0.82,F:0.83) Cornell (R:0.40, P:0.21, F:0.23) Color Hist (R:0.07, P:0.04, F:0.05) Gray Hist (R:0.18, P:0.12, F:0.13) Results

Fribourg (R:0.93, P:0.91, F:0.92) Cornell (R:0.80, P:0.51, F:0.54) 4 Frames Tolerance 4 Frames Tolerance Color Hist (R:0.13, P:0.09, F:0.10) Gray Hist (R:0.27, P:0.17, F:0.19) Results (Cont’d)

Low-resolution Docs Identification • Difficulties in Identification • Hard to use existing DAS (50-100 dpi) • Performance of OCR is very bad • Hard to extract complete layout (Physical, Logical) • Rotation, translation and resolution affect global image matching • Captured images vary : lighting, flash, distance, auto-focusing, motion blur, occlusion, etc.

Proposed Docs Identification • Based on Visual Signature • Shallow layout with zone labeling • hierarchically structured using features’ priority • Identification : matching of signatures • Matching : simple heuristics, following hierarchy of signature

Visual Signature Extraction • Common resolution, RLSA • Zone labeling (text, image, solid bars, etc.) • Blocks separation: Projection Profiles • Text blocks (One line per block) • Bullet and vertical text line extraction

Feature vector for Image, Bars (Horizontal and Vertical), Bullets : • Feature vector for each Text line and, Bar with text (Horizontal and Vertical): Bounding box of various features Visual Signature Extraction

Structuring Visual Signature • Hierarchy depends on extraction process & real world slideshow • Narrows the search path during matching <VisualSign> <BoundingBox NoOfBb="10"> <Text NoOfLine="7"> <HasHorizontalText NoOfSentence="7"> <Sy="53" x="123" width="436" height="25" NoOfWords="4" PixelRatio="0.40" /> … </HasHorizontalText><HasVerticalTextNoOfSentence="0" /> </Text> <HasImage NoOfImage="3"> <Imagey="1" x="16" width="57" height="533" PixelRatio="0.88" /> … </HasImage> <HasBullet NoOfBullets="2"> <Bullety="122" x="141" width="12" height="12" PixelRatio="1.0" /> .. </HasBullet> <Line NoOfLine="0"><HasHLineNoOfLine="0" /><HasVLineNoOfLine="0" /></Line> <BarWithText NoOfBar="0"> <HBarWithTextNoOfBar="0" /><VBarWithTextNoOfBar="0" /> </BarWithText> </BoundingBox> </VisualSign>

F Bbox Text Bar Line Text Image Bullets f2 f3 H-Line H-Text V-Text V-Line HBarText VBarText f1 f7 f4 f5 f6 f8 Tree representation of features in visual signature Structured Signature-based Matching • Search Technique: • Takes the advantage of hierarchical structure of visual signature • Higher level features compared  lower-level features matched

Matching Performance Results • Evaluation based on Recall and Precision • ~ 200 slide images (web-cam) queried (repository  300 slides) (R:0.94, P:0.80, F:0.86)

Conclusion • Proposed Slide Change Detection • Automatic evaluation • Performance : best compared to state-of-the-art • Lower time and computational complexity • Overcomes: auto-focusing, fading nature of web-cam • Performance : accuracy improved compared to Cornell (low tolerance) • Could be used for meeting indexing : high precision

Conclusion • Proposed Slide Identification: • Based on Visual Signature • No need for any classifier • Fast : only Signature matching (without global image matching) • Without OCR • Could be helpful for real-time application (translation, mobile OCR, etc.) • Applicable for digital cameras and mobile phones • Finally: Documents as a way for indexing & retrieval

Future Works • Evaluation of animation • Detection and identification : pointed and partially occluded documents • Identification with complex background structure • Evaluation: Digital cameras, mobile phones • Background pattern and color information to Visual Signature • Identification of documents on table

Possible Projects • Deformation correction (Perspective, Projective, etc.) • Automatic detection of projected documents in the captured video • Detection of occluded objects • Background pattern recognition

Thank You !

Passive Capture and Structuring of Lectures