Adaptive Computer Interfaces

Adaptive Computer Interfaces Capability Augmentation using language modelling

Table of Contents • Entropy or Information content of text • Prediction by Partial Matching (PPM) • Dasher • Language Models • Dirichlet Model • Dealing with redundancies

Entropy • is the information content or entropy due to statistics extending over N adjacent letters of text, • is a block of N-1 letters (N-1 grams), • is an arbitrary letter following , • is the probability of the N-gram ,

Entropy of English Language which is the entropy if all the characters in the string are taken to be independent (0-gram), i.e., not considering the syntax and semantics of the language. • If we consider the frequency of occurrence of each character, bits per character. • And the di-gram approximation gives bits per character. • For large N, the entropy of N-gram model was estimated to be around 1 bit per character (Shannon, 1950).

Entropies in Different Language Models

Text Prediction • A popular method for text prediction/compression is the Prediction by Partial Matching (PPM). • Finite–context models use the preceding few symbols (the context) to estimate the probability of the next one. • The number of preceding symbols i.e. the length of the context used is the order of the model. Order 1 model corresponds to a bigram model. • An order 0 model uses the marginal frequencies of symbols. Order −1 model has a uniform distribution over symbols.

Prediction by Partial Matching • Blending: • where is adaptive probability assigned to a symbol by the finite–context model of order o, and • is the weight assigned to the model. • where are the escape probabilities. • When PPM encounters a novel character which has not been seen in the particular context, it “escapes” to the context model of order one less.

Escape Probabilities • One method proposed by Moffat is: • where is the number of different symbols that have occurred in some context of order o, • is the number of times that the symbol occurs in the current context of order o, and • is the total number of times the context has been seen.

Data Entry • A conventional QWERTY keyboard input requires a selection from about 80 keys which corresponds to an information content of 6.3 bits per gesture. • We have seen that the entropy of English language is about 1 bit per character. • Hence the conventional input method seems inefficient and has the potential to be improved by incorporating language models. • PPM can compress most English text to around 2 bits per character, and hence can optimize the information content in input per gesture when used in dynamic input methods.

Dasher • Dasher is an input interface intended for writing short documents like e-mails and memos. • It incorporates language modelling and uses continuous two-dimensional gestures by devices such as mouse, touch screen, eye-tracker. • The interface consists of rectangles with characters and size proportional to the frequency of characters in English language. There are 27 of them, one extra for space. • The user writes a letter by making a gesture towards that letter's rectangle. • The point of view now zooms towards this rectangle. As the rectangles get larger, possible extensions of the written string appear within the rectangle that we are moving towards.

Dynamics of Dasher • For alphabet …, the real line [0,1) is divided into I intervals of lengths equal to the probabilities . • The length of interval • These probabilities are estimated using PPM. • After analyzing it’s initial performance, horizontal and vertical display modification were employed.

Demonstration DASHER

Evaluation • Users were able to achieve typing rate of up to 34 words per minute as compared to a QWERTY keyboard typing of 40-60 words per minute and 26 words per minute of a telephone keyboard. • The majority of users have an error rate of less than 5% when using Dasher, while keyboard errors vary from 2% - 16%. • After applying features like color contrast, an expert user was able to write at a rate of 228 characters per minute (6.5 bits per sec), 1.3 times faster than performance under dictation of 170 characters per minute (4.8 bits per sec).

Language Modelling: Dirichlet Model • Notations and Problem Definition- • , i.e. a string of T words. • = number of time word I occurs • Number of times word j immediately follows word I • We also define a matrix of parameters Q, A single row of Q, the probability vector for transitions from state j, is denoted by . The task is to infer them from the data, D.

We infer the parameters Q from the given data, D by using the Bayes’ theorem We assume the prior probability distribution as a Dirichlet distribution The posterior distribution can hence be found out to be To find the vector we maximize the probability of D, The function Z is the normalization factor in the Dirichlet Distribution.

Redundancy in Dirichlet Model The Dirichlet Model assumes the context of usage of each word to be unique, i.e., it fails to recognize words that are similar in usage. For example, we may have observed the bigrams ‘Tuesday morning’, ‘Tuesday afternoon’ and ‘Thursday morning’, but not ‘Thursday afternoon’ in the training set. In the context ‘Thursday’ we would not be able to predict ‘afternoon’. If we were to infer that the contexts ‘Tuesday’ and ‘Thursday’ are identical, then combining them would mean that ‘afternoon’ is more probable in the context ‘Thursday’. An alternative hypothesis that there are clusters of identical contexts, each described by the same probability vector.

Two hypothesis

We define and also a ratio of probabilities of two hypothesis given evidences, Where , corresponds to the prior We also assume that both the hypothesis are equally likely and hence we have the result

Results with the new model • A vocabulary of 5000 words was reduced to 1969 contexts, some examples are • pass passing passed • februaryjanuaryjulyjune march aged august septemberdecemberoctoberapril • fourth twentieth third seventh • look looked looking stared • shall will might should may must would could • The new model slightly outperforms the original on English text, while requiring fewer parameters and hence less memory • However, implementing with Dasher, no worthwhile gains were observed

References • D. J. Ward, A. F. Blackwell, and D. J. C. MacKay. Dasher - A Data Entry Interface Using Continuous Gestures and Language Models. In Proceedings of UIST 2000, pages 129-137, 2000. • D. J. Ward. Adaptive Computer Interfaces. PhD thesis, Univ. of Cambridge, 2001. • C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, January 1951. • W.J. Teahan. Probability estimation for PPM. In Proceedings NZCSRSC’95, 1995.

Adaptive Computer Interfaces