Document Image Analysis Lecture 12b: Integrating other info

Document Image AnalysisLecture 12b: Integrating other info Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS294-9 Fall 2000

Srihari/Hull/Choudhari (1982): Merge sources • Bottom-up refinement: transition probabilities at the character sequence level • Top-down process based on searching in a lexicon • Standard (now) presentation of usual methods • Viterbi algorithm and variations • Trie representation of dictionary UC Berkeley CS294-9 Fall 2000

Tao Hong(1995) UC Berkeley CS294-9 Fall 2000

Verifying recognition! UC Berkeley CS294-9 Fall 2000

Lattice-based matchings… UC Berkeley CS294-9 Fall 2000

Word collocation: the idea • Given the choice [ripper, rover, river], you look at +/- ten words on each side. • If you find “boat” then choose “river”. • Useful for low (<60%) results, boosting them to >80% • Not too useful for improving highly reliable recognition (may degrade) UC Berkeley CS294-9 Fall 2000

Basis for collocation data Word collocation = mutual information ; P(x,y) is probability of x and y occurring within a given distance in a corpus. P(x) is probability of x occurring in the corpus, resp. P(y); (probability  frequency). Measure this for a test corpus. In the target text, repeatedly re-rank based on top choices until no more changes occur. UC Berkeley CS294-9 Fall 2000

Using Word Collocation via Relaxation Algorithm The sentence is “Please show me where Hong Kong is!” UC Berkeley CS294-9 Fall 2000

Results on collocation UC Berkeley CS294-9 Fall 2000

Lattice Parsing UC Berkeley CS294-9 Fall 2000

Back to the flowchart… UC Berkeley CS294-9 Fall 2000

Not very encouraging UC Berkeley CS294-9 Fall 2000

Experimental results (Hong, 1995) • Word types from Wordnet • Home-grown parser • Data from Wall St. Journal, other sources • Perhaps 80% of sentences could be parsed, not all correctly • Cost was substantial (minutes) to parse a sentence given the (various) choices of word identification. UC Berkeley CS294-9 Fall 2000

Document Image Analysis Lecture 12b: Integrating other info