290 likes | 423 Vues
This exploration discusses strategies for storing and searching a dictionary efficiently in a computer using Finite State Automata (FSA) and Tries. We present the concept of representing words of fixed length using a regular expression and the advantages of each data structure for dictionary operations. We analyze memory consumption, including character storage estimates, and explore the implementation of searches involving inflectional morphology in natural languages. Insights are provided on optimizing search using suffix trees while considering linguistic paradigms.
E N D
Finite State Automata and Tries SambhavJain IIIT Hyderabad
Think !!! • How to store a dictionary in computer? • How to search for an entry in that dictionary? • Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’ Eg. aaaaaaaaaa, abcdefghij, …. etc Language = [a-z]{10} - RegEx Finite State Automata and Tries
A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz A Linear Sorted List of Entries Finite State Automata and Tries
A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz Character to be stored = 2610 = 1.41167096 × 1014 Each character take 1 Byte ~ 141 TB Finite State Automata and Tries
Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries
Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. • Total Storage = 26x10 = 260 bytes • Traverse 10 nodes ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries
Does it work for Natural Language • Oxford Advanced English Learner 20th Edition • A quarter of a million distinct English words, excluding inflections, and words from technical and regional vocabulary not covered by the OED • After inflections ? • eat,eats,eaten,eating ….. • What after multiple inflexion ??? • beauty, beautiful, beautifully … Finite State Automata and Tries
Example (Store & Search) e a t s e i n g n Finite State Automata and Tries
Example b e a t s e i n g n Finite State Automata and Tries
Example b f e a a s t s e i n g n Finite State Automata and Tries
Example b f e a a s t s e i n g n w r i t e Finite State Automata and Tries
Inflectional morphology • Deals with word forms of a root, when there is no change in lexical category. • Each word form gives different values of features like gender, number, person, etc. Finite State Automata and Tries
Paradigm • For a given root, there are many word forms with different features. • Ex. Forms of Hindi root laDakA (boy) Finite State Automata and Tries
Paradigm - 'laDakoM' is plural with oblique case - given by feature structure {num=pl, case=obl} - 'laDake' stands for two feature structures + Singular oblique (Ex. laDake ne kahA ...) - where oblique means 'laDake' is followed by a postposition marker + plural direct case (Ex. laDake Aye) Finite State Automata and Tries
Paradigm o Paradigms - What operation is done on root to obtain word forms - Model using pairs: (delete string, add string) | direct oblique ---|----------------------- sg| (O,O) (A,e) pl | (A,e) (A,oM) o List roots with paradigms they follow: - ghoDA follows paradigm laDakA - charkhA follows paradigm laDakA - laDakA follows paradigm laDakA Finite State Automata and Tries
l k | | a a | | D p | | -------- a | | | a A D | | | k ------- | | | | ------------ | I i | | | ------- | A e o | | | A | | | | | | A e o M M | M Finite State Automata and Tries
Abstracting out suffixes k l | | a a | | p D | | a --------- | | | D #1 a A | | k (#1) I #1: Corresponds to paradigm for 'laDakA' Finite State Automata and Tries
- Suffix trie (forward) #1 | -------------- | | | e o A | M Finite State Automata and Tries
Can we further optimize our search ? - Use knowledge of paradigms - Use suffix tree Finite State Automata and Tries
Store suffix tree in main memory • Store rest of the categorized by paradigm in hard disk • Do backward search for suffix tree • Identify the paradigm • Search only in that paradigm set • Eg. if ‘–ing’ occur you first won’t be searching word like home, cat, god … Finite State Automata and Tries
Finite State Automata • Trie is a data structure • FSA is the computational approach • Slight difference in representation • Putting characters on edges rather than nodes Finite State Automata and Tries
+ / \ l / \ k + + a | | a | | + + D | | p | | + + a | | a | | + + k | | D | | + + \ / 0 \ / 0 +______ e/ \o \ A / \ \ (+) + (+) | |M (+) Finite State Automata and Tries
FSA o A deterministic finite-state machine formally is - Q: A finite set of states (Ex.:{q0,q1,q2}) - SIGMA: A finite set of input alphabet (Ex.: {a,b,c}) - Start state: A state in Q, from which machine starts (Ex.: q0) - F: A set of accepting states (Ex.: {q2}) - DELTA (q,i): A transition function or transition matrix where: - q MEMBER Q, i MEMBER SIGMA, - DELTA(q,i) MEMBER Q Thus, DELTA(q,i): Q x SIGMA --> Q Finite State Automata and Tries
RECOGNITION Problem • Till now we were handling only RECOGNITION problem • If FSA reach a final state at the end of input string thenEXIST • ElseNOT Finite State Automata and Tries
But we seek analyzed output • We want the machine to tell • Root • Gender • Number • Person • Case • Etc …… Finite State Automata and Tries
Finite State Transducer FST is like the finite state automation defined earlier, except each arc is labelled by a pair of symbols: i:o where i: symbol in input string o: symbol output by FST when are is taken + Ex. arc in finite state transducer corresponding to 'e' in 'ladake' e : ((+pl, -direct), (+sg, +dir)) q1 +----------------->--------------------+ q2 Two pairs of symbols: i : o - i is: 'e' - o is: '((+pl, -direct), (+sg, +dir))' + Ex. Morph Analyzer: Match input with i, if successful go ahead & produce o in output Finite State Automata and Tries
o Formally: Finite state transducer - Q: Finite set of states q0, ..., qN - SIGMA_IN: Finite set of input symbols - SIGMA_OUT: Finite set of pairs output symbols - q0: Start state (q0 IN Q) - F: Set of final accepting states (F SUBSET Q) - DELTA (q, i:o) : For every state q, gives a set of states that can be reached from q with i in SIGMA_IN, and o in SIGMA_OUT. Finite State Automata and Tries
Example • on board Finite State Automata and Tries
Tools for FSA • Lex • OpenFST • (www.openfst.org/) • AT&T FSM Toolkit • (http://www2.research.att.com/~fsmtools/fsm/) Finite State Automata and Tries