Advanced Filtering and Flag Diacritics

Advanced Filtering and Flag Diacritics Thursday PM Kenneth R. Beesley Xerox Research Centre Europe

Advanced Filtering & Flag Diacritics • When specifying morphotactics in lexc or regular expressions, it is often convenient and attractive to start with a grammar that overgenerates and overrecognizes.

Advanced Filtering & Flag Diacritics • An initial overgenerating network must subsequently be constrained by • Composition of filters at compile time • Known as “composing in” the restrictions • This can easily result in an explosion in the size of the network • or Simulation of composition at runtime • Slow • or Flag Diacritics • Recognized and applied at runtime • This method can avoid size explosions and still run very fast

Goal of this Presentation • Quick review of the linguistic problem of separated dependencies • Illustrate the traditional solutions and their problems • Acquaint you with Flag Diacritics

Continuation Classes and Concatenation • The “continuation classes” of lexc translate into concatenation: LEXICON Foo root1 Suff ; root2 Suffx ; LEXICON Suff ard # ; LEXICON Suffx xarc # ; • So the co-occurrence restrictions between one morpheme and the very next morpheme(s) are usually easy to handle.

Constraining “separated dependencies” can be awkward in lexc or regular expressions: prefix1+prefix2+prefix3+stem+suffix1+suffix2+suffix3 • prefix1 may be incompatible with suffix 3 • prefix1 may be optional, but if it is present, it might require the presence of suffix3 • suffix2 may be optional, but if it is present, it might require a previous prefix2 • etc., etc., etc.

Quick Review of Composition • Transducers have an upper-side language and a lower-side language • By Xerox convention, the upper-side language contains analysis strings, usually consisting of a root and tags, e.g. Root[Tag1][Tag2][Tag3] • The lower-side language usually consists of orthographical strings • If you compose a filter or rules on the top of the network, it must match the upper-side language • If you compose a filter or rules on the bottom of the network, it must match the lower-side language

Constraining Separated Dependencies within Words via Composition of Filters • E.g. assume that the presence of one morpheme, “foo”, precludes the co-occurrence of another morpheme, “fum”, anywhere later in the word. • Let “foo” be spelled “foo^X” on the upper side, and let “fum” be spelled “fum^Y” on the upper side, where ^X and ^Y are declared multicharacter symbols that we will use as features. • An overgenerating lexicon mylex.fst contains ungrammatical strings like …. f o o^X … f u m^Y … • One solution: eliminate such ungrammatical strings by the compile-time composition of a suitable filter, then map the features to epsilon. 0 <- [ %^X | %^Y ] .o. ~$[ %^X ?* %^Y ] .o. mylex.fst

A Second, Equivalent Solution • Again, assume that the presence of one morpheme, “foo”, precludes the co-occurrence of another morpheme, “fum”, anywhere later in the word. • Let “foo” be spelled “foo^X” on the upper side, and let “fum” be spelled “fum^Y” on the upper side, where ^X and ^Y are declared multicharacter symbols that we will use as features. • We can state the restriction as “^Y occurs only in words where it is not preceded by ^X”. 0 <- [ %^X | %^Y ] .o. %^Y => .#. ~$[%^X] _ .o. mylex.fst

More Separated Dependencies within Words • Assume that the presence of one morpheme, e.g. “fie”, requires the co-occurrence of another morpheme, e.g. “fee”, somewhere later in the same word. • Let “fie” be spelled “fie^X” on the upper side, and let “fee” be spelled “fee^Y” on the upper side. • The overgenerating lexicon mylex.fst contains ungrammatical strings like …. f i e ^X … where fee^Y does not occur after fie^X • One solution: Eliminate such ungrammatical strings at compile time by the composition of a suitable filter, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ?* %^X ~$[ %^Y] ] .o. mylex.fst

Another equivalent solution: • Again, assume that the presence of one morpheme, e.g. “fie”, requires the co-occurrence of another morpheme, e.g. “fee”, somewhere later in the same word. • Let “fie” be spelled “fie^X” on the upper side, and let “fee” be spelled “fee^Y” on the upper side. • If ^X appears, it must be followed by a ^Y 0 <- [ %^X | %^Y ] .o. %^X => _ $[%^Y] .o. mylex.fst

More Separated Dependencies within Words • Assume that the presence of one morpheme, e.g. “fee”, requires the co-occurrence of another morpheme, e.g. “fie”, somewhere earlier in the same word. So fie is usually optional, but it is required with a following fee. • Let “fee” be spelled “fee^Y” on the upper side, and let “fie” be spelled “fie^X” on the upper side. • The overgenerating lexicon mylex.fst contains ungrammatical strings like …. f ee^Y … where fie^X does not occur before fee^Y • One solution: Eliminate such ungrammatical strings at compile time by the composition of a suitable filter, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ~$[%^X] %^Y ?* ] .o. mylex.fst

Another equivalent solution • Again, assume that the presence of one morpheme, e.g. “fee”, requires the co-occurrence of another morpheme, e.g. “fie”, somewhere earlier in the same word. • Let “fee” be spelled “fee^Y” on the upper side, and let “fie” be spelled “fie^X” on the upper side. • ^Y, if it appears, must be preceded by ^X 0 <- [ %^X | %^Y ] .o. %^Y => $[%^X] _ .o. mylex.fst

Problems with the Traditional “composing in” of Constraints • When you “compose in” such restrictions, for separated dependencies, the overgeneration and overrecognition are eliminated, but the resulting transducer tends to get bigger, sometimes verybig. • The general problem is that all the states and arcs between the two co-restricted morphemes need to be copied.

Arabic Articles and Case Endings A bare Arabic stem can generally take any one of six case endings: +Nom u +Def +Gen ε +Acc i k a a t i b a +Indef +Nom ε uN +Gen iN +Acc aN Assume that the subnetwork represented here by “kaatib” contains all the noun stems and is very large.

Arabic Articles and Case Endings But an Arabic noun can also, optionally, take the al- prefix, which is an overt definite article, e.g. kaatibu or alkaatibu. Using lexc or xfst, we could easily make it an optional prefix thus +Nom u +Def ε +Gen Art+ l ε +Acc i a k a a t i b a +Indef +Nom ε ε uN +Gen iN +Acc aN Unfortunately, this straightforward solution overgenerates. The overt al- prefix can in fact co-occur only with +Def case endings. This is a classic “separated dependency”.

Arabic Articles and Case Endings You can filter out the bad strings using compile-time composition: ~$[ Art%+ ?* %+Indef ] .o. +Nom u +Def ε +Gen Art+ l +Acc i a ε k a a t i b a +Nom +Indef ε ε uN +Gen iN +Acc aN

But this almost doubles the size of the network! To impose this constraint in pure finite-state terms, the whole noun-stem structure is duplicated in the course of composition. ε Art+ k a a t i b a l +Def ε +Nom u +Def +Gen ε +Acc ε i k a a t i b a +Nom +Indef ε uN +Gen iN +Acc aN

An Alternative Solution:Simulation of Composition at Runtime • Using Xerox utilities like ‘lookup’, you can specify “lookup strategies” that involve compositions that are simulated at runtime, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ?* %^X ~$[ %^Y] ] .o. mylex.fst • This keeps the transducers small and produces the same results as compile-time composition, • But it usually runs more slowly

Flag Diacritics: A Practical Alternative • What are flag diacritics? • Simple feature-like symbols for imposing constraints • Especially useful for enforcing “separated dependencies” between morphemes • Motivations • Keep networks smaller (prevent “blow-ups”) • Keep networks maximally efficient at runtime • How to use them • Syntax • Semantics • When to use them

What Are Flag Diacritics? • As far as regular expressions, lexc and networks are concerned, Flag Diacritics are just multicharacter symbols, defined or declared like any other multicharacter symbols. • The linguist can add Flag Diacritic symbols to any strings in the network. They usually become part of the spelling of a morpheme. • Flag Diacritics have a distinctive spelling, delimited by @-signs, and with 2 or 3 fields delimited by periods (full stops), e.g. @U.feature.value@ @P.feature.value@ @D.feature@ • Flag Diacritics allow simple, efficient and highly valuable feature constraints at runtime.

The Semantics of Flag Diacritics • During the application of a network, Flag Diacritic symbols are not matched against the input strings or included in the output strings. In this sense, Flag Diacritics are treated like epsilons. • But unlike epsilons, a network path labeled with a Flag Diacritic is successfully traversed at runtime only if the operation indicated by the Flag Diacritic is successful. • The operations involve feature-setting and feature-unification. • The Flag Diacritics are “noticed”, and the feature-like operations are performed, by “flag-sensitive” runtime code, e.g. ‘apply up’ and ‘apply down’.

Arabic Articles and Case Endings with Flag Diacritics: Start with an overgenerating network ε +Nom @U.ART.YES@ l u +Def +Gen Art+ a +Acc i ε k a a t i b a +Indef +Nom ε ε @U.ART.NO@ uN +Gen iN +Acc aN Contains illegal paths like: Art+ @U.ART.YES@ k a a t i b +Indef @U.ART.NO@ +Nom a l @U.ART.YES@ k a a t i b @U.ART.NO@ uN

Arabic Articles and Case Endings The network still contains illegal paths like: Art+ @U.ART.YES@ k a a t i b +Indef @U.ART.NO@ +Nom a l @U.ART.YES@ k a a t i b @U.ART.NO@ uN • But while exploring this path, looking up the bad word *alkaatibuN, the flag-sensitive ‘apply up’ routine will • Find @U.ART.YES@ on the lower-side • Treat it as an epsilon (it consumes no input) • But will remember the feature setting ART = YES • Eventually find @U.ART.NO@ on the lower side, treat it as an epsilon, but • Will try to “unify” ART = NO with the stored valued ART = YES and will FAIL • The illegal path is therefore blocked at runtime

Flag Diacritics • Each flag diacritic signals the runtime apply routine to perform a little feature-based operation. • The arc labeled with the flag diacritic is traversed only if the feature-based operation is successful; otherwise the algorithm abandons the path and backtracks for other solutions. • The application routines contain a very small amount of memory for storing feature values. • If used correctly, Flag Diacritics allow your network to contain illegal paths that are noticed and rejected at runtime. • The result is to get the restrictions you need, without the network blowing up in size, and with minimal loss of speed.

The basic @U.feature.value@ flags • All features start out with neutral/unset values. • The spelling is @U.feature.value@, where the feature and value strings are chosen by the linguist. They have no inherent meaning to the system. • If the application routine finds @U.X.Y@, and there is no stored value for X, then it simply sets feature X = Y in its little memory. • If the application routine finds @U.X.Y@ and there is a previously set value for X, then the routine will attempt to unify the new value with the old one. If successful, the arc is traversed; otherwise fail. • All you need in many practical applications are @U.feature.value@ flags.

You do not need to declare flag diacritics • As far as networks and sigmas are concerned, a Flag Diacritics is just a normal multicharacter symbol. • Flag Diacritics do have a distinctive spelling, surrounded by @-signs. • Flag Diacritics are “noticed” or “obeyed” only by application routines like ‘apply up’ and ‘apply down’ that have been rewritten to be “sensitive to flag diacritics.”

Other Feature-Diacritic Types @P.feature.value@ Positive Reset (re)sets feature = value; always succeeds @N.feature.value@ Negative Reset (re)sets feature # value always succeeds

Other Feature-Diacritic Types @R.feature.value@ Require: succeeds iff currently feature = value @R.feature@ Require: succeeds iff currently feature is set to a non-neutral value @D.feature.value@ Disallow: succeeds iff currently feature is set to something other than value @D.feature@ Disallow: succeeds iff feature is neutral/unset @C.feature@ Clear: (re)set back to neutral/unset value; always succeeds

A Trap for the Unwary • ‘apply up’ matches the input against the lower-side of a transducer and notices Flag Diacritics only on the lower-side • ‘apply down’ matches the input against the upper-side of a transducer and notices Flag Diacritics only on the upper-side • So typically you want to define your networks so that Flag Diacritics are visible on both sides of a transducer. • But experts may want to build systems with different restrictions on analysis and generation.

Some Success Stories • Hungarian Morphological Analyzer • Was 35 Megabytes • Now 5 Megabytes, after adding 5 Flag Diacritic attributes • French Morphological Analyzer • Was 11 Megabytes • Now 5 Megabytes, after adding 37 Flag Diacritic attributes • New German Morphological Analyzer • Now just 171,934 arcs • Explodes to 2,247,984 arcs after eliminating just two Flag Diacritics (among many)

When to Use Flag Diacritics • Use them when you need to keep networks smaller, especially when there are separated dependencies. • I also find them very useful to keep lexc descriptions simpler. Avoid the proliferation of continuation classes. • You can always remove Flag Diacritics from a network using ‘eliminate flag’: xfst[]: eliminate flag attrname • The effect of ‘eliminate flag’ is the same as composing in the restrictions, usually resulting in an increase in size.

More Information on Flag Diacritics • “The Book” (Beesley & Karttunen, 2003) contains a whole chapter on Flag Diacritics. • Sonja Bosch and Laurette Pretorius have used them successfully to constrain the combination of class prefixes with Zulu roots. • Contact me if you need help.

Advanced Filtering and Flag Diacritics

Advanced Filtering and Flag Diacritics

Presentation Transcript

Recursive Bayes Filtering Advanced AI

Recursive Bayes Filtering Advanced AI

Flag

Filtering

Arabic Diacritics حركات Based steganography

Noise and Filtering

Noise and Filtering

Flag and Motto

Flag

FLAG

Filtering

Filtering and Centrifugation

ADVANCED INTERDISCIPLINARY DATA ASSIMILATION: FILTERING AND SMOOTHING VIA ESSE

Excel Tutorial 7 Using Advanced Functions, Conditional Formatting, and Filtering

Filtering

Recursive Bayes Filtering Advanced AI

Flag

Advanced Application and Web Filtering

Advanced Techniques for Automatic Web Filtering

Filtering and Color

flag