pTrees - Fast Horizontal Compressed Data Processing

pTreespredicateTreetechnologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. • 1 • 1 • 1 • 1 • 1 • 1 • 1 • 1 1 • 1 • 0 • 0 course 2 3 4 5 PINEPodium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. 1 0 • 13 Text • 12 1 1 1 • 1 document • 1 1 • 1 • 1 0 Buy • 1 • 1 1 • 1 • 1 person 0 1 Enroll FAUSTFast Accurate Unsupervised, Supervised Treemining uses pTtrees for classification and clustering of spatial data. 2 3 4 MYRRHManY-Relationship-Rule Harvester uses pTrees for association rule mining of multiple relationships. PGP-DPretty Good Protection of Data protects vertical pTree data. key=array(offset,pad) 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ... ConCurConcurrency Control uses pTrees for ROCC and ROLL concurrency control. DOVEDOmain VEctors Uses pTrees for database query processing. Applications:

level_2 =s150_s10_gt60_PPW,1 1 (The level_2 bit strides 150 level_0 bits) 11111 11100 01011 level_1 = s10gt60_PPW,1 (Each level_1 bit (15 of them) strides 10 raw bits) level_0 1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011 The 150 level_0 raw bits level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL mn gap SW mn gap PL mn gap PW mn gap se 2 11.6 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 41.8 9.4 ve 13.6 5.6 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 19.2 se 37.2 vi 70.6 FAUSTFast Accurate Unsupervised, Supervised Treemining uses pTrees for classification and clustering of spatial data. E.g., to cluster the IRIS dataset of 150 iris flower samples, (50 setosa, 50 versicolor, 50 virginica iris's) using 2-level 60% ipTrees with each upper level bit representing the predicate truth applied to 10 consecutive iris samples), level-1 is shown below. FAUST clusters perfectly using only this level (order of magnitude smaller bit vectors - so faster processing!). FAUSTusing impure pTrees (ipTrees) All pTrees defined by Row Set Predicates (T/F on any row-sets). E.g.: On T(A,B,C), "units bit slice pTree of T.A, using predicate, > 60% 1-bits, true iff >60% of the A-values are odd. level-1 values: SL SW PL PW setosa 38 38 14 2 setosa 50 38 15 2 setosa 50 34 16 2 setosa 48 42 15 2 setosa 50 34 12 2 versicolor 1 24 45 15 versicolor 56 30 45 14 versicolor 57 28 32 14 versicolor 54 26 45 13 versicolor 57 30 42 12 virginica 73 29 58 17 virginica 64 26 51 22 virginica 72 28 49 16 virginica 77 30 48 22 virginica 67 26 50 19 Level-1 mn 54.2 30.8 35.8 11.6 setosa 47.2 37.2 14.4 2 versicolor 45 27.6 41.8 13.6 virginica 70.6 27.8 51.2 19.2

FAUST using impure pTrees (ipTrees) page 2 SL mn gap SW mn gap PL mn gap PW mn gap SL mn gap SW mn gap PL mn gap PW mn gap cH = 45 + 25.6/2 = 57.8 cH = 2 + 11.6/2 = 7.8 se 2 11.6 ve 27.6 .2 ve 27.6 .2 se 14.4 27.4 ve 45 25.6 ve 45 2.2 ve 41.8 9.4 ve 41.8 9.4 ve 13.6 5.6 ve 13.6 5.6 vi 27.8 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 37.2 vi 70.6 vi 70.6 CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 4. choose best class and attribute for cutting gapL is gap on low side of a mean. gapH is high 2. Remove record with max gapRELATIVE. (perfect classification of the rest!) CLASS PW setosa 2 setosa 2 setosa 2 setosa 2 setosa 2 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 FAUST (simplest version) For each attribute (column), 1. calculate mean of each class; 2. sort those means asc; 3. calc mean_gaps=differences_of_means; 4. choose largest (relatively) mean_gap to cut. (perfect on setosa!) 1. 2. 3. done on previous slide

FAUST using impure pTrees (ipTrees) page 3 In the previous two FAUST slides, three-level 60% ipTrees were used (leaves are level=0, root is level=2) with each level=1 bit representing the predicate truth applied to 10 consecutive iris samples (leaf bits, i.e., the level=1 stride=10). Below, instead of taking the entire 150 IRIS samples, 24 are selected from each class as training samples; the 60% is replaced by 50% and level=1 stride=10 is replaced with level=1 stride=12 first, then level=1 stride=24. Note: The means (averages) are almost the same in all cases. level_1 s24gt50_PSL,j s24gt50_PSW,j s24_gt50_PPL,j s24gt50_PPW,j level=1 stride=12, each of the 2 level=1 bits strides 12 of 24 se 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 51 38 15 0 se 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 50 34 14 2 ve 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 0 57 28 45 14 ve 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 63 30 40 8 vi 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 72 28 49 18 vi 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 69 30 48 22 se 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 51 34 15 2 ve 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 57 30 41 14 vi 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 73 30 49 22 level=1 stride=24, each of the level=1 bits strides 24 of 24 24 samples from each class as training (every other one in the list of 50), first form 3-level gt50%ipTrees with level=1 stride=12. second form 3-level gt50%ipTrees, level=1 stride=24 (i.e., just a root above 3 leaf strides, 1 for each class). Conclusion: Uncompressed 50%ipTrees (with root truth values) root values are close to the mean?

3. Rough pTrees A pTrees is defined by a Tuple Set Predicate (T/F on every set of tuples). E.g., for bit-slices, roughly pure1 might have predicate: " x% 1-bits", 0<x<100. Pure1 pTrees "100% 1-bits" and Pure0 pTrees with predicate "0% 1-bits". To be a little more complete, given a table, T(A,B,C), and given the units bit-slice on T.A (1 iff the A-value is odd) the rough predicate, " 75% 1-bits" on a set of tuples, S, is 1 (true) if  75% of the A-values in S are odd. pTree creation is a 1-time cost. Storage is infinite (many pTrees is fine). In fact, our security shuffle will benefit from the added pTrees. Research problem: combine multiple pTree levels and roughness. Multi-level pTree upper levels can be info sparse (mostly 0s?). The rougher the predicate, the more upper level 1-bits. metadata of inode: fanout, segment length it strides, roughness %. SL mn gap SW mn gap PL mn gap PW mn gap SL mn gap SW mn gap PL mn gap PW mn gap Alternatively for last step (PW): Another alternatively-last step (PL): SW mn gap SW mn gap PL mn gap PL mn gap PW mn gap PW mn gap ve 27.6 .2 ve 27.6 .2 cH=13.6+5.6/2=16.4 cH = 45 + 25.6/2 = 57.8 cH=41.8+9.4/2=46.5 cH = 2+11.6/2=7.8 ve 41.8 9.4 ve 41.8 9.4 vi 27.8 vi 27.8 ve 13.6 5.6 ve 13.6 5.6 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 2 11.6 ve 27.6 .2 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 45 25.6 ve 41.8 9.4 ve 41.8 9.4 ve 13.6 5.6 ve 13.6 5.6 vi 27.8 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 37.2 vi 70.6 vi 70.6 CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 (perfect classification of the rest!) FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. CLASS PW setosa 2 setosa 2 setosa 2 setosa 2 setosa 2 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 (One mistake only!) versicolor 45 versicolor 45 versicolor 32 versicolor 45 versicolor 45 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 (perfect!) (perfect on setosa!)

Separate classR, classV using midpoints of meansmethod: calca • vomV • vomR • d-line • d • v2 • v1 • std of these distances from origin • along the d-line • a FAUST Oblique PR = P(X dot d)<a D≡ mRmV= oblique vector. d=D/|D| • View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2o d(Very same formula works when D=mVmR, i.e., points to left) • Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) • Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use • 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, • 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) • median{v2|vV}, ... ) • dim 2 • r r vv • r mR r v v v v • r r v mV v • r v v • r v • dim 1

PX dot d>a = PdiXi>a AND 2 pTrees masks P(mrmv)/|mrmv|oX<a P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt b r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b r r r v v r mr r v v v r r v mv v r v v r v grb grb grb grb grb grb grb grb grb bgr bgr bgr bgr bgr bgr bgrbgr bgr bgr D g D = mrmv For classes r and b For classes r and v 4. FAUST Oblique:length, std, rkK for selecting best gap and multiple attrs. formula: P(X dot D)>aX any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ). E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d NOTE:!!! The picture on this page could be misleading. See next slide for a clearer picture FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? a P(mbmr)oX>(mr+m)|/2od "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them. r

PX dot d>a = PdiXi>a 4. FAUST Oblique:midpt, std, rkK for selecting best gap and multiple attrs. formula:P(X dot D)>a X any set of vectors. D≡ mrmvis the oblique vector (Note: if D=ei, PXi>a ) and let d=D/|D| To separate r from v: Using means_midpoint, calculate a as follows: Viewing mr and mv as vectors ( e.g., mr≡originpoint_mr ), a = ( mr + (mv-mr)/2 ) o d = (mr+mv)/2o d r r r v v r mr r v v v r r v mv v r v v r v d a

3. Rough pTrees pTrees defined by Tuple Set Predicates (T/F on every set of tuples). E.g., pred for bit-slices, roughly pure1 might be " x% 1-bits", 0<x<100. We note that rough pTrees coincide with pure pTrees unless they are multi-level (compressed). The lowest level of a rough pTree is identical to that of the corresponding pure1 pTree (assuming x>0). Pure1 pTrees can be viewed in the same way - as pTrees with predicate: " 100% 1-bits" and Pure0 pTrees with predicate " 0% 1-bits". Given a table, T(A,B,C), and given the units bit-slice on T.A (1 iff the A-value is odd) the rough predicate, " 75% 1-bits" on a set of tuples, S, is 1 (true) if  75% of the A-values in S are odd.pTree creation is a 1-time cost. Storage is infinite (many pTrees fine). Our security shuffle benefits from added pTrees.Research prob: combine multiple pTree levels and roughness. Multi-level pTree upper levels are info sparse (mostly 0s?). The rougher the predicate, the more upper level 1-bits. metadata of inode: fanout, segment length it strides, roughness %. level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL mn gap SW mn gap PL mn gap PW mn gap se 2 11.6 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 41.8 9.4 ve 13.6 5.6 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 19.2 se 37.2 vi 70.6 Can't cluster (classify image pixels) at level-k, if level-k "pts" (tuplesets of level-k segments) substantially span >= 2 image training classes). For each level-k point that substantially spans clusters 1 and 2 about equally, one would expect that the method applied at level-k would not make a clear choice. If it did, there would be something wrong because the info is just not there. Here's the point (regarding image classification): IRIS results suggest: If 150 tuples were given for classification into 3 classes (50 training samples for each class, setosa, versicolor and virginica), then knowing the classes in the training set, we can adjust our level_strides so that the upper level pTrees see the same training classes (and just as clearly - that's what's startling and great!) as the full training set does. We have done that (witness: setosa training samples are rows 1-50, versicolor are 51-100 and virginica are 101-150; and all strides fit those boundaries. level-1 values: SL SW PL PW setosa 38 38 14 2 setosa 50 38 15 2 setosa 50 34 16 2 setosa 48 42 15 2 setosa 50 34 12 2 versicolor 1 24 45 15 versicolor 56 30 45 14 versicolor 57 28 32 14 versicolor 54 26 45 13 versicolor 57 30 42 12 virginica 73 29 58 17 virginica 64 26 51 22 virginica 72 28 49 16 virginica 77 30 48 22 virginica 67 26 50 19 Level-1 mn 54.2 30.8 35.8 11.6 setosa 47.2 37.2 14.4 2 versicolor 45 27.6 41.8 13.6 virginica 70.6 27.8 51.2 19.2

SL mn gap SW mn gap PL mn gap PW mn gap SL mn gap SW mn gap PL mn gap PW mn gap Alternatively for last step (PW): Another alternatively-last step (PL): SW mn gap SW mn gap PL mn gap PL mn gap PW mn gap PW mn gap ve 27.6 .2 ve 27.6 .2 cH=41.8+9.4/2=46.5 cH = 2 + 11.6/2 = 7.8 cH=13.6+5.6/2=16.4 cH = 45 + 25.6/2 = 57.8 ve 41.8 9.4 ve 41.8 9.4 vi 27.8 vi 27.8 ve 13.6 5.6 ve 13.6 5.6 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 2 11.6 ve 27.6 .2 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 45 25.6 ve 41.8 9.4 ve 41.8 9.4 ve 13.6 5.6 ve 13.6 5.6 vi 27.8 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 37.2 vi 70.6 vi 70.6 3. (cont) FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. CLASS PW setosa 2 setosa 2 setosa 2 setosa 2 setosa 2 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 (perfect on setosa!) (One mistake only!) versicolor 45 versicolor 45 versicolor 32 versicolor 45 versicolor 45 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 (perfect!)

4. FAUST Oblique: using length, std or rankK to determine best gap and/or using multiple attrs We have a pTree ALGEBRA(pTree operators, AND, OR, COMP, XOR, ... and their algebraic properties) We have a pTree CALCULUS(functions that produce the pTree mask for just about any pTree-defining predicate). Multi-attribute "FAUST-Oblique" mask pTree formula:P(X dot D)>aX is any set of vectors D is an oblique vector (if D=ei=(0,...,1,...0) then this is just the existing EIN formula for the ith dimension, PXi > a ). PdoX>a = PdiXi>a FAUST-Oblique based heuristic: Instead of finding the best D, take as D, the vector connecting a given class mean to another class mean as D ( and d= D/|D| ) P(mrmv)/|mrmv|oX<a r r r v v r mr r v v v r r v mv v r v v r v D = mrmv For classes r and v Where a can be calculated either as (mr is a medoid for class r, i.e., the mean or vector_of_medians) 1. a = ( domr + domv )/2 2. Letting ar=max{dor}; av=min{dov} (when domr<domv, else reverse max and min). Take a = av 3. Using variance gap fits.(or rankK gap fits) as detailed in appendix slides. Apply to other classes in a particular order (by quality of gap)? FAUST-Oblique: For isolating a class 1. Create table, TBL(classi, classj, medoid_vectori, medoid_vectorj) 2. Apply the pTree mask formula at left. Notes: 1. If we take the fastest route and just pick the one class which when paired with r, gives the max gap, then we can use max gap or maximum_std_Intersection_point instead of max_gap_midpoint. Then we need stdj (or variancej) in TBL.

FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) = SPD1oX = SPD1,iXi SPF(X)-min pD1,0 pD1,-1 pD1,-2 pD1,-3 pe1,0 pe1,-1 pe1,-2 pe1,-3 pe2,0 pe2,-1 pe2,-2 pe2,-3 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*1.0 -½*3.0 = 1*1.5 -½*3.0 = 1*1.2 -½*2.4 = 1*0.6 -½*2.4 = 1*2.2 -½*2.1 = 1*2.3 -½*3.0 = 1*2.0 -½*2.4 = 1*2.5 -½*2.4 = 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 -.5 0 0 -.6 1.15 .8 .8 1.3 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 SPe1oX h1 [.6,1.5] h2 [2,2. 5] SPe2oX h1 [2.4,3] h2 [2.1,3] SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] Idea: Incrementally build clusters one at a time using all F values. E.g., start with one pt, x. Recall F dis dominated, which means actual distance ≥ F difference. If the hull is close to convex hull, max Fdiff approximates distance? Then 1st gap in maxFdiss isolates x-cluster? (1.5,3) b (1,3) a 2.3,3) f c (1.2,2.4) g (2,2.4) d (.6,2.4) h (2.5,2.4) e (2.2,2.1) F(b=F(c)) F(a) F(f)=F(g) F(h) F(d) F(e) D1=(1 , -½)

FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) = SPD1oX = SPD1,iXi SPF(X)-min mxFdf(h) 1.8 1.3 1.3 1.9 .3 .6 .5 0 {e,f,g,h} h-cluster. Gap=.7 mxFdf(g) 1.3 .8 .8 1.4 .35 .6 0 .5 {b,c,e,f,g,h} g-cluster. Gap=.5 mxFdf(f) 1.3 .8 1.1 1.7 .9 0 .6 .6 all in f-cluster. mxFdf(e) 1.65 1.15 1.15 1.75 0 .9 .35 .3 {e,g,h} e-cluster Gap=.55 mxFdf(a) 0 .5 .6 .6 1.65 1.3 1.3 1.8 {a,b,c,d} a-cluster. Gap=.7 mxFdf(b) .5 0 .6 .9 1.15 .8 .8 1.3 All in b-cluster mxFdf(c) .6 .6 0 .6 1.15 1.1 .8 1.3 All in c-cluster. mxFdf(d) .6 .9 .6 0 1.75 1.7 1.4 1.9 {a,b,c,d} d-cluster Gap=.5 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*1.0 -½*3.0 = 1*1.5 -½*3.0 = 1*1.2 -½*2.4 = 1*0.6 -½*2.4 = 1*2.2 -½*2.1 = 1*2.3 -½*3.0 = 1*2.0 -½*2.4 = 1*2.5 -½*2.4 = 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 -.5 0 0 -.6 1.15 .8 .8 1.3 SPe1oX h1 [.6,1.5] h2 [2,2. 5] SPe2oX h1 [2.4,3] h2 [2.1,3] SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] Incrementally build clusters 1 at a time with F values. E.g., start with 1 pt, x. Recall F dis dominated, which means actual separation ≥ F separation. If the hull is well developed (close to convex hull) max Fdiff approximates distance? Then 1st gap in maxFdis isolates x-cluster? (1.5,3) b (1,3) a 2.3,3) f g (2,2.4) c (1.2,2.4) d (.6,2.4) h (2.5,2.4) e (2.2,2.1) F(b=F(c)) F(a) F(f)=F(g) F(h) F(d) F(e) D1=(1 , -½)

4. cont: Multi-attribute Oblique (FAUST-O) heuristic: Instead of finding best D, take vector connecting a class means as DTo separate r from v: D=(mvmr) and a=|mv+vr|/2 ANDing the two pTrees masks the region (which is r) P(mvmr)oX>|mr+mv|/2 masks vectors that makes a shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b b r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b grb grb grb grb grb grb grb grb grb bgr bgr bgr bgr bgr bgr bgrbgr bgr bgr D g For classes r and b To separate r from b: D=(mbmr) and a=|mb+vr|/2 Best cutpt? mean, vector_of_medians, outmost, outmost_non-outlier? P(mbmr)oX>|mr+mb|/2 In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} can find them (consider greenish-redish-blue and bluish-greenish-red): r "outermost, = "furthest from means (their projections of the D-line); By "outermost non-outlie" I mean the furthest non-outlier points; Other possibilities: the best rankK points, the best std points, etc. "medoid-to-mediod" close to optimal provided the classes are convex. Final note: I should say "linearly separable instead of convex (slightly weaker condition).

86 34 22 30 17 17 11 11 13 17 53333 42443 31331 43121 32152 34334 11111 01110 11001 00111 10101 10111 10010 11011 11110 10011 11100 00100 11100 10011 00100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011 s3_s2_s5_s5_gt60_PPW,1 IRIS (3,2,5,5)-leveled 60% rough pTrees (level_4 each bit strides 3 bits at level_3) 0 s2_s5_s5_gt60_PPW,1 (level_3 each bit strides 2 bits at level_2) 100 s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) 11 10 01 s5_gt60_PPW,1 (level_1 each bit strides 5bits at level_0) 11111 10111 10110 11000 10010 11111 PPW,1 11111 01110 11001 00111 10101 10111 10010 11011 11110 10011 11100 00100 11100 10011 00100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011 11 1's out of 30, not 15 (>=60%) Consider Node 2.2, which is a s5_s5_gt60 node as described above. Note that there are only 11 1-bits out of 30 at the leaf (level-0) of its subtree which is well short of the 15 required for gt60% thus the node truth value is at least misleading (It does correctly indicate that gt60% of the next level bits are 1-bits, but it incorrectly suggests that gt60% of the raw level-0 bits are 1-bits.). One way around this problem is to use pure1 above level-1. That way, a 2.2 1-bit would indicate that all 5 level-1 bits are 1's and thus all level-0 5-bit strings have a majority of 1-bits or at least 3. Thus the level-0 stride of 2.2 has at least 15 1-bits and thus the "true" at 2.2 correctly indicates that there are a majority of 1-bits strided by it at level-0 (as well as at level-1). However, what do we do if (as is the case above) 2.2 strides a majority of level-1 1-bits but a minority of level-0 1-bits? The use of either a 0 or a 1 bit at 2.2 is misleading. I suggest: residualize all rough pTree bit-vectors (as done for gt60 predicate above) and then for each inode, residualize the level count arrays (for level-1 and up):

Rough pTrees ARE pTrees in which the predicate gives definition of roughly or nearly pure. Recall that all pTrees are defined by a Tuple Set Predicate (TSP) which evaluates to True or False on every set of tuples (rows) of the horizontal table which is being represented vertically by those pTrees. e.g., for a bitslice (which is a 1-column table) roughly pure1 might be defined by TSP: "at least 75% of the bits are 1-bits". Rough pTrees can be raw (uncompressed) or Multi-Level (with any number of levels from 1 on up - as can any pTree) since they are bonafide pTrees, albeit with a different predicate - a "roughly pure" predicate. These rough pTrees (in which we tune the choices of the "roughness" to the data characteristics or statistics??) would be created and residualized along with the pure pTrees (which are also rough pTrees at the extremes -100% and 0% 1-bits). Creation is a one-time cost. The extra storage space is a non-issue in this age of infinite storage. And the addition of many, many more pTrees redundantly representing a data table, means, among other things, that we can apply our "security shuffle" much more effectively (needing fewer, if any, bogus pTrees?). We can use multiple levels of roughness together in the same algorithm (e.g., FAUST). A research problem: effectively combine multiple pTree levels and roughness. When we create multi-level pTrees, we often see the upper levels become "info sparse or even info free (all zeros)". The consequences include the fact that those levels may be of no data mining value, and sometimes, only the leaf level is of value. Using a rougher pTree predicate instead, populates any pTree with more upper level 1-bits. For a given table or data area, how many levels and which definition(s) of roughness provide the most data mining advantage? The metadata of an inode would include (in general) its fanout, the segment length it strides, and its roughness percentage. Another way to include that option is to require that any pTree be built using a constant global roughness. We could switch to pTrees of a different roughness (for the same bit slice) in our data mining as our algorithm reaches a given inode level. It would be impossible to accurately cluster (or classify image pixels) at, say, level-k of a pTree, if level-k "points" (the tuple sets making up level-k segments) substantially spanned two or more of the clusters (or image training classes). For each level-k point that substantially spans clusters 1 and 2 about equally, one would expect that the method applied at level-k would not make a clear choice. If it did, there would be something wrong because the info is just not there. The intent of the previous slides is to demonstrate that we can get the same accuracy from level-3 as level-0 (at least sometimes) if the above is true. Another way to look at it is that we cannot mine information out of a set of upper level pTrees if there is no information at that level (keeping in mind that the information may be there at lower levels). So here's the point (regarding image classification): The IRIS results suggest that: If the 150 tuples were given to us as training for classification of other unclassified IRIS tuples into one of the three classes (50 training samples for each class, setosa, versicolor and virginica), then what we have shown (only suggested in general, but proved in this particular case) is that, knowing the classes in the training set, we can adjust our level_strides so that the upper level pTrees see the same training classes (and just as clearly - that's what's startling and great!) as the full training set does. We have done that (witness: setosa training samples are rows 1-50, versicolor are 51-100 and virginica are 101-150; and all strides fit those boundaries.

1 level_2 (root) = s15_s10_gt60_PPW,1 (The level_2 bit strides 15 level_1 bits) (Each level_1 bit (15 of them) strides 10 level_0 bits) level_1 = s10gt60_PPW,1 11111 11100 01011 level_0 There are 150 level_0 bits 1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011 level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL mn gap SW mn gap PL mn gap PW mn gap se 2 11.6 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 41.8 9.4 ve 13.6 5.6 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 19.2 se 37.2 vi 70.6 FAUST: means-sequential Initially, let PREMAINING be pure1 (all records still remains to be processed). 1. For each attribute, calculate the mean for each class and sort asc on mn. Calculate all mean_gaps = difference_between_consecutive_means. Create MT(attr, class, mean, gapL, gapH, gapREL) sorted on gapREL =(gapL+gapH)/mn) gapL is on lo side of mean. apH, hi) 2. Choose and remove the MT record with max gapRELATIVE. Use cL=mean-gapL/2 and cH=mean+gapH/2 for PL=PA>cL PH=P'A>cH Class mask PCLASS=PL&PH&PREM , update PREM=PREM&P'CLASS 3. Repeat 2 until all classes have a pTree mask 4. Repeat 1,2,3 until ?. level-1 values: SL SW PL PW setosa 38 38 14 2 setosa 50 38 15 2 setosa 50 34 16 2 setosa 48 42 15 2 setosa 50 34 12 2 versicolor 1 24 45 15 versicolor 56 30 45 14 versicolor 57 28 32 14 versicolor 54 26 45 13 versicolor 57 30 42 12 virginica 73 29 58 17 virginica 64 26 51 22 virginica 72 28 49 16 virginica 77 30 48 22 virginica 67 26 50 19 Lev1 means 54.2 30.8 35.8 11.6 setosa 47.2 37.2 14.4 2 versicolor 45 27.6 41.8 13.6 virginica 70.6 27.8 51.2 19.2

SL mn gap SW mn gap PL mn gap PW mn gap SL mn gap SW mn gap PL mn gap PW mn gap Alternatively for last step (PW): Another alternatively-last step (PL): SW mn gap SW mn gap PL mn gap PL mn gap PW mn gap PW mn gap ve 27.6 .2 ve 27.6 .2 cH=41.8+9.4/2=46.5 cH = 2 + 11.6/2 = 7.8 cH=13.6+5.6/2=16.4 cH = 45 + 25.6/2 = 57.8 ve 41.8 9.4 ve 41.8 9.4 vi 27.8 vi 27.8 ve 13.6 5.6 ve 13.6 5.6 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 2 11.6 ve 27.6 .2 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 45 25.6 ve 41.8 9.4 ve 41.8 9.4 ve 13.6 5.6 ve 13.6 5.6 vi 27.8 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 37.2 vi 70.6 vi 70.6 FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. CLASS PW setosa 2 setosa 2 setosa 2 setosa 2 setosa 2 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 (perfect on setosa!) (One mistake only!) versicolor 45 versicolor 45 versicolor 32 versicolor 45 versicolor 45 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 (perfect!)

86 34 22 30 17 17 11 11 13 17 53333 42443 31331 43121 32152 34334 11111 01110 11001 00111 10101 10111 10010 11011 11110 10011 11100 00100 11100 10011 00100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011 s3_s2_s5_s5_gt60_PPW,1 IRIS (3,2,5,5)-leveled 60% rough pTrees (level_4 each bit strides 3 bits at level_3) 0 s2_s5_s5_gt60_PPW,1 (level_3 each bit strides 2 bits at level_2) 100 s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) 11 10 01 s5_gt60_PPW,1 (level_1 each bit strides 5bits at level_0) 11111 10111 10110 11000 10010 11111 PPW,1 11111 01110 11001 00111 10101 10111 10010 11011 11110 10011 11100 00100 11100 10011 00100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011 11 1's out of 30, not 15 (>=60%) Consider Node 2.2, which is a s5_s5_gt60 node as described above. Note that there are only 11 1-bits out of 30 at the leaf (level-0) of its subtree which is well short of the 15 required for gt60% thus the node truth value is at least misleading (It does correctly indicate that gt60% of the next level bits are 1-bits, but it incorrectly suggests that gt60% of the raw level-0 bits are 1-bits.). One way around this problem is to use pure1 above level-1. That way, a 2.2 1-bit would indicate that all 5 level-1 bits are 1's and thus all level-0 5-bit strings have a majority of 1-bits or at least 3. Thus the level-0 stride of 2.2 has at least 15 1-bits and thus the "true" at 2.2 correctly indicates that there are a majority of 1-bits strided by it at level-0 (as well as at level-1). However, what do we do if (as is the case above) 2.2 strides a majority of level-1 1-bits but a minority of level-0 1-bits? The use of either a 0 or a 1 bit at 2.2 is misleading. I suggest: residualize all rough pTree bit-vectors (as done for gt60 predicate above) and then for each inode, residualize the level count arrays (for level-1 and up):

89 41 31 17 11 11 10 09 08 06 07 10 10 07 86 4331 3233 2341 4113 1223 2211 1222 1423 1423 223 69 17 17 17 11 11 13 17 53333 42443 31331 43121 32152 34334 1111 1011 1011 0010 0111 1010 1101 1110 0101 1011 1111 0100 1111 1000 0100 1110 0100 1100 1001 1011 1100 1010 0001 0100 0010 0101 1001 10 01 0001 1111 1001 0111 0010 1111 0110 0111 0110 1100 1011 94 41 31 22 11111 01110 11001 00111 10101 10111 10010 11011 11110 10011 11100 00100 11100 10011 00100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011 11 11 10 09 08 06 07 10 10 11 01 4331 3233 2341 4113 1223 2211 1222 1423 1423 2234 1 86 86 41 31 14 41 31 14 11 11 10 09 08 06 07 10 10 04 1111 1011 1011 0010 0111 1010 1101 1110 0101 1011 1111 0100 1111 1000 0100 1110 0100 1100 1001 1011 1100 1010 0001 0100 0010 0101 1001 10 01 0001 1111 1001 0111 0010 1111 0110 0111 0110 1100 1011 1111 01 11 11 10 09 08 06 07 10 10 04 4331 3233 2341 4113 1223 2211 1222 1423 1423 22 4331 3233 2341 4113 1223 2211 1222 1423 1423 22 1111 1011 1011 0010 0111 1010 1101 1110 0101 1011 1111 0100 1111 1000 0100 1110 0100 1100 1001 1011 1100 1010 0001 0100 0010 0101 1001 10 01 0001 1111 1001 0111 0010 1111 0110 0111 0110 11 1111 1011 1011 0010 0111 1010 1101 1110 0101 1011 1111 0100 1111 1000 0100 1110 0100 1100 1001 1011 1100 1010 0001 0100 0010 0101 1001 10 01 0001 1111 1001 0111 0010 1111 0110 0111 0110 11 From the discussion on the previous slide, it seem practical to have the same fanout through out the tree?. Otherwise it is very difficult to even identify inodes (e.g., what does 2.2 mean). global_fanout= 4 for images, 8 for solids, 64 for sparse numeric non-spatial data cols?, 1024 for very sparse numeric data columns and for high cardinality bitmapped categorical columns????? On the other hand, maybe a database_global_fanout so that the processing code is simpler??? global_fanout=5: global_fanout=4: As the table grows:

Sepal Length.Sepal WidthPedal Length.Pedal Wth vi 63 33 60 25 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 vi 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0 0 1 1 vi 68 30 55 21 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 vi 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 0 vi 58 28 51 24 0 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 vi 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1 vi 65 30 55 18 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 vi 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 1 0 vi 77 26 69 23 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 1 1 vi 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 vi 69 32 57 23 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 1 vi 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 vi 77 28 67 20 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 vi 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 vi 67 33 57 21 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 vi 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 vi 62 28 48 18 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 vi 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 vi 64 28 56 21 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 vi 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 0 0 0 vi 74 28 61 19 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 vi 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 vi 64 28 56 22 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 vi 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 1 1 vi 61 26 56 14 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 vi 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 vi 63 34 56 24 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 vi 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 vi 60 30 18 18 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 vi 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 0 1 vi 67 31 56 24 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 vi 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 vi 67 33 57 25 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 vi 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 1 0 1 1 1 vi 63 25 50 19 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 vi 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 1 0 1 0 0 vi 62 34 54 23 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 vi 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 Sepal Length.Sepal WidthPedal Length.Pedal Wth se 51 35 14 2 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 5 36 14 2 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 se 48 34 16 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 se 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 se 43 30 11 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 se 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 se 57 44 15 4 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 se 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 se 51 35 14 3 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 1 1 se 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 se 51 38 15 3 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 se 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 se 51 37 15 4 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 se 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 se 51 33 17 5 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 se 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 se 50 30 16 2 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 se 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 se 52 35 15 2 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 16 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 se 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 se 54 34 15 4 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 se 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 se 55 42 14 2 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 50 32 12 2 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 se 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 se 51 34 15 2 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 se 45 23 13 3 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 1 se 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 50 35 16 6 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 se 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 se 48 30 14 3 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 se 46 32 14 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 se 50 33 14 2 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 ve 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 0 ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 ve 59 30 42 15 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 1 1 ve 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 ve 61 29 47 14 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 ve 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 ve 67 31 44 14 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 ve 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 58 27 41 10 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 ve 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 56 25 39 11 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 ve 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 ve 61 28 40 13 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 ve 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 61 28 47 12 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 ve 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 1 ve 66 30 44 14 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 0 ve 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 ve 67 30 50 17 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 0 0 0 1 ve 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 ve 57 26 35 10 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 ve 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 1 ve 55 24 37 10 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 ve 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 ve 60 27 51 16 0 1 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 ve 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 60 34 45 16 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 ve 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 ve 63 23 44 13 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 0 1 ve 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 1 ve 55 25 40 13 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 ve 61 30 46 14 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 ve 58 26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 ve 50 23 33 10 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 ve 56 27 42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 ve 57 30 42 12 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 ve 57 29 42 13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 ve 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 1 ve 51 25 30 11 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 ve 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 Notational note: We develop IRIS 6_5_5 pTrees: level_3 (root) segment_stride=6, level_2 segment_stride=5; level_1 segment stride=5; for roughly_pure1 predicate: ">60% 1-bits". PPW,1, as 3-level 60% rough pTree with seg strides of 6,5,5. root = s6_s5_s5_gt60_PPW,1 1 1 1 1 0 0 1 s5_s5_gt60_PPW,1 11111 10111 10110 11000 10010 11111 s5gt60_PPW,1 11111 01110 11001 00111 10101 10111 10010 11011 11110 11111 11101 00101 11110 11111 10100 11011 11001 01000 01010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011

1 level_2 (root) = s15_s10_gt60_PPW,1 (The level_2 bit strides 15 level_1 bits) (Each level_1 bit (15 of them) strides 10 level_0 bits) level_1 = s10gt60_PPW,1 11111 11100 01011 level_0 There are 150 level_0 bits 1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011 level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL mn gap SW mn gap PL mn gap PW mn gap se 2 11.6 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 41.8 9.4 ve 13.6 5.6 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 19.2 se 37.2 vi 70.6 level-1 values: SL SW PL PW setosa 38 38 14 2 setosa 50 38 15 2 setosa 50 34 16 2 setosa 48 42 15 2 setosa 50 34 12 2 versicolor 1 24 45 15 versicolor 56 30 45 14 versicolor 57 28 32 14 versicolor 54 26 45 13 versicolor 57 30 42 12 virginica 73 29 58 17 virginica 64 26 51 22 virginica 72 28 49 16 virginica 77 30 48 22 virginica 67 26 50 19 Lev1 means 54.2 30.8 35.8 11.6 setosa 47.2 37.2 14.4 2 versicolor 45 27.6 41.8 13.6 virginica 70.6 27.8 51.2 19.2

SL mn gap SW mn gap PL mn gap PW mn gap SL mn gap SW mn gap PL mn gap PW mn gap Alternatively for last step (PW): Another alternatively-last step (PL): SW mn gap SW mn gap PL mn gap PL mn gap PW mn gap PW mn gap ve 27.6 .2 ve 27.6 .2 cH=41.8+9.4/2=46.5 cH = 2 + 11.6/2 = 7.8 cH=13.6+5.6/2=16.4 cH = 45 + 25.6/2 = 57.8 ve 41.8 9.4 ve 41.8 9.4 vi 27.8 vi 27.8 ve 13.6 5.6 ve 13.6 5.6 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 2 11.6 ve 27.6 .2 ve 27.6 .2 se 14.4 27.4 ve 45 2.2 ve 45 25.6 ve 41.8 9.4 ve 41.8 9.4 ve 13.6 5.6 ve 13.6 5.6 vi 27.8 vi 27.8 9.4 se 47.2 13.4 vi 51.2 vi 51.2 vi 19.2 vi 19.2 se 37.2 vi 70.6 vi 70.6 FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. CLASS PW setosa 2 setosa 2 setosa 2 setosa 2 setosa 2 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 versicolor 15 versicolor 14 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 (perfect on setosa!) (One mistake only!) versicolor 45 versicolor 45 versicolor 32 versicolor 45 versicolor 45 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 (perfect!)

level_1 s5gt60_PSL,j 6 5 4 3 2 1 0 PPL,j 6 5 4 3 2 1 0 s5_s5_gt60_PPL,j 6 5 4 3 2 1 0 s5_s5_gt60_PSL,j 6 5 4 3 2 1 0 s5_s5_gt60_PSW,j 5 4 3 2 1 0 s5_s5_gt60_PPW,j 4 3 2 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 PSW,j 5 4 3 2 1 0 PPW,j 4 3 2 1 0 IRIS 3 level rough pTrees To the right of the pTrees are the corresponding numbers in decimal Can we do effective clustering (classification) at level_1? Yes. Not surprisingly, since we have already demonstrated that ability for IRIS s5gt60 (level_1 of the previous 2 level IRIS pTrees). But more importantly, can we go to level_2, s5_s5_gt60, and still get reasonable clustering? It's interesting-we can (25-fold data reduction). Note in PW, the first two values (2s from setosa) separate from the next four (cut_point=7) Within the final four values, the first two (15, 12 from versicolor) separate from the final two (18, 21 from virginica) with cutpoint 17. s6_s5_s5_gt60(level_3) 0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 39 38 14 2 54 39 15 2 50 46 14 2 51 38 15 3 50 32 19 2 54 34 16 2 52 43 15 0 51 34 13 2 50 35 13 2 50 36 14 2 69 20 45 15 49 25 47 12 56 28 42 14 58 31 45 15 61 28 41 13 64 30 32 14 54 26 39 10 63 31 45 13 58 26 40 12 57 29 42 13 63 31 58 19 73 29 63 17 64 24 51 20 77 22 55 23 77 24 49 20 72 28 56 18 79 28 56 22 77 30 54 18 67 27 59 19 59 30 54 23 s5_s5_gt60(level_2) s5gt60 (level_1) 50 38 15 2 50 34 13 2 57 28 45 15 58 30 40 12 77 28 51 21 75 30 58 18 The same PW cutpoints, 7 and 17, separate the classes perfectly at level_1. Note: except for SW, the two other attributes cluster the 3 classes perfectly. (SL_cutpts=54,63; PL_cutpts=30,48 and if fact {se} {ve,vi} SW_cutpt=32 works!) Can you explain the fact that level_2 clusters as well or better than level_1??

s5gt60_PPL,j 6 5 4 3 2 1 0 IRIS 4 level rough pTrees s5gt60level_1 s5_s5_gt60level_2 s2_s5_s5_gt60level_3 s3_s2_s5_s5_gt60level_4 s5gt60_PSL,j j= 6 5 4 3 2 1 0 s5gt60_PSW,j 5 4 3 2 1 0 s5gt60_PPW,j 4 3 2 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 39 38 14 2 54 39 15 2 50 46 14 2 51 38 15 3 50 32 19 2 54 34 16 2 52 43 15 0 51 34 13 2 50 35 13 2 50 36 14 2 69 20 45 15 49 25 47 12 56 28 42 14 58 31 45 15 61 28 41 13 64 30 32 14 54 26 39 10 63 31 45 13 58 26 40 12 57 29 42 13 63 31 58 19 73 29 63 17 64 24 51 20 77 22 55 23 77 24 49 20 72 28 56 18 79 28 56 22 77 30 54 18 67 27 59 19 59 30 54 23 PW cutpts, 7,16, separate classes perfectly at levels 1, 2, and 3. PL cutpts, 27,48, separate classes perfectly at levels 1, 2, and 3 also. Note level_4 can't separate since all classes are entirely spanned by the 1 node at that level. However, the values are close to the global means 58.1 30.5 37.3 11.9 Level_3 values are very good estimates of the means: 49.1 34.1 14.6 2.44 59.3 27.7 42.6 13.2 65.8 29.7 54.9 20.2 s5_s5_gt60_PPL,j s5_s5_gt60_PSL,j s5_s5_gt60_PSW,j s5_s5_gt60_PPW,j 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 50 38 15 2 50 34 13 2 57 28 45 15 58 30 40 12 77 28 51 21 75 30 58 18 s2_s5_s5_gt60 s2_s5_s5_gt60, s2_s5_s5_gt60 s2_s5_s5_gt60 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 50 34 13 2 56 28 40 12 77 30 50 16 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 56 28 40 0 (level_4 each bit strides 3 bits at level_3) 1 level_4 pTree = s3_s2_s5_s5_gt60_PPW,1 0 (level_3 each bit strides 2 bits at level_2) 100 3 level_3 pTrees = s2_s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) 6 level_2 pTrees = s5_s5_gt60_PPW,1 11 10 01 (level_1 each bit strides 5bits at level_0) 30 level_1 pTrees = s5_gt60_PPW,1 11111 10111 10110 11000 10010 11111 150 level_0 pTrees=PPW,1 11111 01110 11001 00111 10101 10111 10010 11011 11110 11111 11101 00101 11110 11111 10100 11011 11001 01000 01 010 00010 01011 00110 01000 11111 10010 11100 10111 10110 01110 11011

FAUST (2011_06_11) Using length, std or rankK to determine best gap and/or using multiple attrs to improve accuracy. We have a pTree ALGEBRA(pTree operators, AND, OR, COMP, XOR, ... and their algebraic properties) We have a pTree CALCULUS(functions that produce the pTree mask for just about any pTree-defining predicate). Multi-attribute "FAUST-Oblique" mask pTree formula:P(X dot D)>aX is any set of vectors D is an oblique vector (if D=ei=(0,...,1,...0) then this is just the existing EIN formula for the ith dimension, PXi > a ). PdoX>a = PdiXi>a FAUST-Oblique based heuristic: Instead of finding the best D, take as D, the vector connecting a given class mean to another class mean as D ( and d= D/|D| ) P(mrmv)/|mrmv|oX<a r r r v v r mr r v v v r r v mv v r v v r v D = mrmv For classes r and v Where a can be calculated either as (mr is a medoid for class r, i.e., the mean or vector_of_medians) 1. a = ( domr + domv )/2 2. Letting ar=max{dor}; av=min{dov} (when domr<domv, else reverse max and min). Take a = av 3. Using variance gap fits.(or rankK gap fits) as detailed in appendix slides. Apply to other classes in a particular order (by quality of gap)? FAUST-Oblique: For isolating a class 1. Create table, TBL(classi, classj, medoid_vectori, medoid_vectorj) 2. Apply the pTree mask formula at left. Notes: 1. If we take the fastest route and just pick the one class which when paired with r, gives the max gap, then we can use max gap or maximum_std_Intersection_point instead of max_gap_midpoint. Then we need stdj (or variancej) in TBL.

Some topics to consider With all the "break ins" occuring (e.g., citibank, etc.) how can data be protected? Can vertical data be protected more easily than horizontal data? Can pTree representation be useful in protecting data? Some modification of standard pTrees? Some preliminary ideas: 1. with pTrees you need to know the ordering to have any information. 2. You also need to know which pTree stands where (in which column and which bit slice) to have any info. 3. If all pTrees are made the same length (using the max file length over the database). then we can shuffle/scramble/alter the ordering of columns/slices and even of the ordering to conceal information. With pTree representations, there are no horizontal data records (as opposed to indexes which are vertical structures which accompany the horizontal data files). pTrees ARE the data as well as the indexes. My thoughts include: pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. Therefore we want to devise a mechanism based on the above notions (or others?) in which the "scambled" pTree data can be processed without unscrambling it? So I'm thinking, for data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone but anyone qualified could issue a datamining request (a classification/ARM/clustering request) and get the answer even though the actual data would never be exposed. I suppose that's not much different really from encrypting the data, but encrypting massive data stores is never a good options and decryption is usually necessary to mine info from the store.

gap Cut-HyperPlane, CHP D APPENDIX: Using a quadratic hyper-surface? (instead of a hyper-plane) Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: b b b b b b b r b r b r r b r r r r r r -------+------------------------------------------------------> D-line mean for the b class D-line mean for the r class Take the r and the b points that project closest to the D-line as the "best" support pair. similarly for the "next best" or "second best" support pair similarly for the "third best" pair. Form the quadratic support curve from the three r-support points for class-r Form the quadratic support curve from the three b-support points for class-b (or move each point in each pair 1/3 of the way toward the other and then do the above) or ????.

Cut-HyperPlane, CHP D Fitting a parabola with focus=p=b-mean and directrix = line_perpendicular_to_the_D-line through mean midpoint, with pTree mask, Letting M=mrmb, X a point, we want the mask pTree, P MoX > d(p,X) MoX > d(p,X)  (MoX)2 > d2(p,X) and (MoX)2 = (m1x1 + m2x2)2 d2(p,X) = (p1-x1)2 + p2-x2)2 P(mrmb)oX=|mr+mb|/2 m12x12 + 2m1m2x1x2 + m22x22 > p12x12 +2p1x1 + p12 + p22x22 +2p2x2 + p22 (m12-p12)x12 + 2m1m2x1x2 + (m22-p22)x22 - 2p1x1 - 2p2x2 > p12 + p22 P should do it. (m12-p12)x12 + 2m1m2x1x2 + (m22-p22)x22 - 2p1x1 - 2p2x2 > p12 + p22 Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: b b b b b b b r b r b r r b r r r r r r -------+------------------------------------------------------> Fitting a parabolic hyper-surface

G bG bG aG aG R bB bB aB aB aR aR bR bR B G R B FAUST is a Near Neighbor Classifier. It is not aVoting NNClike pCkNN (where for each unclassified sample pCkNN builds around that sample, a neighborhood of TrainingSet voters, who then classify sample through majority, plurality or weighted (in PINE) vote. pCkNN classifies one unclassified sample at a time. FAUST is meant for speed and therefore FAUST attempts to classify all unclassified samples at one time. FAUST builds a Big Box Neighborhood (BBN) for each class and then classifies all unclassified samples in the BBN into that class (constructing said class-BBNs with one EIN pTree calculation per class). The BBNs can overlap, so the classification needs to be done one class at a time sequentially, in maximum gap, maximum number of std's in gap, or minimum rankK in gap order.) The whole process can be iterated as in k-means classification using the predicted classes [or subsets of] as the new training set. This can be continued until convergence. A BBN can be a coordinate box: for coord R, cb(R,class,aR,bR) is all x such that aR<xR<bR Either or both of the < can be  or . aR and bR are what were called cut_points of the class. Or BBNs can be multi-coordinate boxes, which are INTERSECTIONs of the best k (kn-1, assuming n classes) cb's for a given class ("best" can be wrt any of the above maximizations). And instead of using a fixed number of coordinates, k, we could use only those in which the "quality" of its cb is higher than a threshold, where "quality" might be measured involving the dimensions of the gaps (or other ways?). FAUST could be combined with pCkNN (probably in many ways) as follows; FAUST multi-coordinate BBN could be used first to classify the "easy points" (that fall in an intersection of high quality BBNs and are therefore fairly certain to be correctly classified). Then for the remaining "difficult points" could be classified using the original training set (or the union of each original TrainingSet class with the new "easy points" of that same class) and using L or Lp , p = 1 or 2.

A Multi-attribute Oblique (FAUST-O) based heuristic: Instead of finding the best D, take the vector connecting a class mean to another class mean as D To separate r from v: D=(mvmr) and a=|mv+vr|/2 ANDing the two pTrees masks the region (which is r) P(mvmr)oX>|mr+mv|/2 masks vectors that makes a shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b For classes r and b To separate r from b: D=(mbmr) and a=|mb+vr|/2 Question: What's the best as cutpt? mean, vector_of_medians, outermost, outermost_non-outlier? P(mbmr)oX>|mr+mb|/2 By "outermost, I mean the "furthest points away from the means in each class (in terms of their projections of the D-line); By "outermost non-outlie" I mean the furthest non-outlier points; Other possibilities: the best rankK points, the best std points, etc. Comments on where to go from here (assuming we can do the above): I think the "medoid-to-mediod" method on this page is close to optimal provided the classes are convex. If they are not convex, then some sort of Support Vector Machines, SVMs, would be the next step. In SVMs the space is translated to higher dimensions in such a way that the classes ARE convex. The inner product in that space is equivalent to a kernel function in the original space so that one need not even do the translation to get inner product based results (the genius of the method). Final note: I should say "linearly separable instead of convex (slightly weaker condition).

gap D Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: blue ^ | rb rb | rb rb | rb rb rb | brrb | brrb | br brrb | br br | br br | br br -------------------------------------+------------------------------------------------------>red D-line mean for the rb class D-line mean for the br class etc. Consecutive class mean mid-point = Cut_Point Cut-HyperPlane, CHP (what we are after) Clearly we would want to find a ~45 degree unit vector, D, then calculate the means of the projections of the two training sets onto the D-line then use the midpoint of the gap between those two means as the cut_point (erecting a perpendicular bisector "hyperplane" to D there - which separates the space into the two class big boxes on each side of the hyperplane. Can it an be masked using one EIN formula??): ^ blue | rb rb | rb rb | rb rb rb | br rb | br rb | br brrb | br br | br br | br br -------------------------------------+------------------------------->red The above "diagonal" cutting produces a perfect classification (of the training points). If we had considered only cut_points along coordinate axes, it would have been very imperfect!

blue ^ | rb rb | rb rb rb | rb rb rb | rb rb | | br br | br br | br br br | br br br ---+---------------------------->red gap gap Cut-HyperPlane, CHP D D gap blue rb rb rb rb rb rb rb rbbr rbbr br br br br br br br red D How do we search through all possible angles for the D that will maximize that gap? We would have to develop the formula (pTree only formula) for the class means for any D and then maximize the gap (distance between consecutive D-projected means). Take a look at the formulas in the book, think about it, take a look at Mohammad’s formulas, see if you can come up with the mega formula above. Let D = (D1, …, Dn) be a unit vector (our “cut_line direction vector) D dot X = D1X1+ …+DnXn is the length of the perpendicular projection of X on D (length of the high noon shadow that X makes on the D line, as if D were the earth). So, we project every training point, Xc,i (class=c, i=1..10), onto D (i.e., D dot Xc,i). Calculate D-line class means, (1/n)(D dot Xc,i), select the max consecutive mean gap along D, (call it best_gap(D)=bg(D). Maximize bg(D) over all possible D. Harder? Calculate it for a [polar] grid of D’s! Maximize over that grid. Then use continuity and hill climbing to improve it. etc. Cut_point More likely the situation would be: rb's are more blue than red and br's are more red than blue. Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: rb rb rb rb rb rb rb brrb brrb br brrb br br br br br br -------+------------------------------------------------------> red blue D-line mean for the rb class D-line mean for the br class What if the training points are shifted away from the origin? This should convince you that it still works.

b grb grb grb grb grb grb grb grb grb bgr bgr bgr bgr bgr bgr bgrbgr bgr bgr D g In higher dimensions, nothing changes (If there are "convex" clustered classes, FAUST{div,oblique_gap} can find them (consider greenish-redish-blue and bluish-greenish-red): r Before considering the pTree formulas for the above, we note again that any pair of classes (multi-classes, as in divisive) that are convex, can be separated by this method. What if they are not convex? A 2-D example: A couple of comments. FAUST resembles the SVD (Support Vector Machine) method in that it constructs a separating hyperplane in the "margin" between classes. The beauty of SVD (over FAUST and all other methods) is that it is provable that there is a transformation to a higher dimensions that renders two non-hyperplane seperable classes to being hyperplane seperable (and you don't actually have to do the transformation - just determine the kernel that produces it.). The problem with SVD is that that it is computationally intensive. I think we want to keep FAUST simple (and fast!). If we can do this generalization, I think it will be a real winner! How do we search over all possible Oblique vectors, D, for the one that is "best"? Of if we are to use multi-box neighborhoods, how do we do that? A heuristic method follows:

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 FAUST_pdq_std (using std's)1.1 Create attribute tables with cl=class, mn, std, n=max_#_stds_in_gap, cp=cut_point (value in the gap which allows the max # of stds, n, to fit forward from mean (using its std) and backward from next mean (using its std)). n satisfies: mean+n*std=meanG-n*stdG so n=(mnG-mn)/(std+stdG) TpLN clmnstdncp se 15 1.0 4.5 19 = 01 0011 ve TA rec with max n 00 1 0 0 1 1= 19 Note, since there is also a case with n=4.1 which results in the same partition (into {se} and {ve,vi}) we might use both for improved accuracy - certainly we can do this with sequential! 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN clmnstdncp se 49 3.5 0.9 53 ve 59 6.9 0.5 62 vi 66 7.6 TsWD clmnstdncp ve 28 3.9 0.3 29 vi 29 3.1 1.3 33 se 33 3.1 TpLN clmnstdncp se 15 1.0 4.5 19 ve 43 5.1 1.3 49 vi 57 6.0 TpWD clnstdncp se 2 0.7 4.1 5 ve 13 2.0 1.5 16 vi 20 2.3 se_means 49.3 33.3 14.6 2.2 se_std 3.5 3.1 1.0 0.7 se_ve_n 0.9 -0.8 4.5 4.1 se_vi_n 1.5 -0.6 6.0 5.8 se_ve_cp 52.6 30.7 19.2 5.3 se_vi_cp 54.5 31.3 20.8 6.5 ve_means 59.0 27.5 42.5 13.4 ve_std 6.9 3.9 5.1 2.0 ve_vi_n 0.5 0.3 1.3 1.5 ve_se_n -0.9 0.8 -4.5 -4.1 ve_vi_cp 62.3 28.5 49.1 16.4 ve_se_cp 52.6 30.7 19.2 5.3 vi_means 65.9 29.3 56.8 19.9 vi_std 7.6 3.1 6.0 2.3 vi_se_n -1.5 0.6 -6.0 -5.8 vi_ve_n -0.5 -0.3 -1.3 -1.5 vi_se_cp 54.5 31.3 20.8 6.5 vi_ve_cp 62.3 28.5 49.1 16.4 Remove se from RC (={ve, vi} now) and TA's

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 FAUST_pdq using std's 1.2 Use the 4 Attribute tables with rv=mean, stds and max_#_stds_in_gap=n, cut value, cp (cp=value in gap which allows max # of stds, n, to fit forward from that mean (using its std) and backward from next mean, meanG, (using stdG). n satisfies mean + n*std = meanG - n*stdG so n=(meanG-mean)/(std+stdG). TpWD clmnstdncp ve 13 2.0 1.5 16 vi TA rec with max n 16= 1 0 0 0 0 P{vi} =PpWD>16 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Note that we get perfect accuracy with one epoch using stds this way!!! TsLN clmnstdncp se 49 3.5 0.9 53 ve 59 6.9 0.5 62 vi 66 7.6 TsWD clmnstdncp ve 28 3.9 0.3 29 vi 29 3.1 1.3 33 se 33 3.1 TpLN clmnstdncp se 15 1.0 4. 19 ve 43 5.1 1.3 49 vi 57 6.0 TpWD clmnstdncp se 2 0.7 4. 5 ve 13 2.0 1.5 16 vi 20 2.3

FAUST_pdq SUMMARY We conclude that FAUST_pdq will be fast (no loops, one pTree mask per step, may converge with 1 [or just a few] epochs?? and is fairly accurate (completely accurate in this example using the std method!). FAUST_pdq is improved (accuracy-wise) by using standard_deviation-based gap measurements and choosing the maximum number of stds as the attribute relevancy choice. There may be many other such improvements one can think of, e.g., using an outlier identification method (see Dr. Dongmei Ren's thesis) to determine the set of non-outliers in each attribute and class. Within each attribute, order by means and define gaps to be between the maximum non-outlier value in one class and the minimum non-outlier value in the next (allowing these gap measurements to be negative if the max of one exceeds the minimum of the next). Also there are many ways of defining representative values (means, medians, rank-points, ...) In Conclusion, FAUST_pdq is intended to be very fast (if raw speed is the need - as it might be for initial processing of the massive and numerous image datasets that the DoD has to categorize and store). It may be fairly accurate as well, depending upon the dataset, but since it uses only one attribute or feature for each division, it is not likely to be of maximal accuracy compared to other methods (such as the FAUST_pms coming up). Next look at FAUST_pms (pTree-based, m-attribute cut_points, sequential (1 class divided off at a time) so we can explore the various choices for m (from 1 to the table width) and alternate distance measures.

For i=4..0 { c=rc(Pc&Patt,i); if (cps){ rankK+= 2i; Pc=(Pc&Patt,i)} [rank(n-K+1)+=2i;] else { ps=ps-c; Pc=Pc&P'att,i }} 16 36 16 36 36 16 1 1 1 1 1 1 1 1 1 1 K=10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 seps= seRK= 20 30 40 10 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1 0 1 36' 16' 30' 10' 40' 20' 25 25 35 15 25 serc= serc= 44 24 14 44 34 44 25' 15' 35' 14' 34' 44' 24' 23 33 13 43 43' 33' 23' 13' 22 12 42 32 12' 42' 32' 22' 21 31 11 41 11' 31' 21' 41' 16 16 36 36 36 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 veRK= veps= 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 36' 16' 25 15 25 25 35 verc= verc= 35' 15' 25' 10 20 30 40 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 10' 40' 20' 30' K 1 HI 44 24 44 14 44 34 24' 44' 34' 14' 43 33 13 23 23' 13' 33' 43' 42 22 32 12 22' 42' 12' 32' 16 36 16 36 36 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 vips= viRK= 31 21 41 11 36' 16' 31' 41' 11' 21' 35 25 25 15 25 virc= virc= 35' 15' 25' 30 40 20 10 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 40' 20' 10' 30' 44 14 44 44 24 34 34' 44' 14' 24' 13 23 33 43 seRK= seps= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 33' 13' 43' 23' 42 22 12 32 serc= serc= 32' 42' 22' 12' 41 21 11 31 21' 41' 11' 31' veRK= veps= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 verc= verc= K 10 LO viRK= vips= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 virc= virc= 1 1 1 1 1 1 1 1 4 0 6 0 0 25 pWD_vi_LO=16  pWD_se_HI=0, pWD_ve_HI=0. So the highest pWD_se_HI and pWD_ve_HI can get is 15 and lowest pWD_vi_LO will ever be is 16. So cutting 16 will separate all vi from {se,ve}. This is, of course, with reference to the training set only and it may not carry over to the test set (much bigger set?) especially since the gap may be small (=1). Here we will use pWDcutpt16 to peal off vi! We need a theorem proof here!!! 1 1 1 1 4 2 0 0 25 26 1 1 1 1 7 2 1 10 25 26 26 24 sLN=1 sWD=2 pLN=3 pWD=4 10 10 10 10 0 6 0 0 10 10 10 10 4 2 0 0 10 10 10 10 7 2 1 10 24 sLN=1 sWD=2 pLN=3 pWD=4

For i=4..0 { c=rc(Pc&Patt,i); if (cps){ rankK+= 2i; Pc=(Pc&Patt,i)} [rank(n-K+1)+=2i;] else { ps=ps-c; Pc=Pc&P'att,i }} 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 seps= seRK= 20 10 40 30 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 16' 36' 16' 36' 30' 20' 40' 10' 35 35 15 15 25 15 35 serc= serc= 24 24 34 24 14 15' 35' 25' 44' 34' 14' 24' 44' 43 23 33 43 13 43 33' 23' 13' 43' 42 32 22 12 32' 42' 12' 22' 41 31 11 21 11' 31' 41' 21' 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 veps= veRK= 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 36' 16' 36' 15 25 35 15 15 35 35 verc= verc= 25' 15' 35' 30 20 10 40 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 30' 20' 40' 10' K 1 HI 14 24 24 24 34 14' 34' 24' 44' 44' 43 33 43 43 13 23 43' 23' 33' 13' 42 12 32 22 22' 32' 42' 12' 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 vips= viRK= 21 31 41 11 31' 11' 41' 21' virc= virc= seps= seRK= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 serc= serc= veRK= veps= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 verc= verc= K 10 LO viRK= vips= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 virc= virc= 1 1 1 1 1 1 1 1 3 10 6 0 0 0 6 0 0 25 +24 25 1 1 1 1 pLN_ve_LO=32  pLN_se_HI=0. So the highest pLN_se_HI can get is 31 and lowest pLN_ve_LO will ever be is 32. So cutting 32 will separate all ve from se! Greater accuracy can be gained by continuing the process for all i and for all K then looking for the best gaps! (all gaps?) (all gaps weighted?) 4 2 0 0 0 2 10 9 +24 25 25 23 26 10 10 10 10 4 10 0 0 0 0 6 0 0 25 10 10 10 10 6 8 6 0 10 9 4 2 0 0 25 25

FAUST{pdq,mrk}(FAUST{pdq}w max rank_k) rank_k(S) is smallest kth largest value in S. FAUST{pdq,gap} divisive, quiet(no noise) with gaps  attr, A TA(class, md, k, cp) its attribute table ordered on md asc, where 0. attr, A TA(class, rv, gap) ord on rv asc (rv=cls rep, gap=dis to next rv. k s.t. it's max k value s.t. set_rank_k of class and set_rank_(1-k)' of the next class. (note: the rank_k for k=1/2 is median, k=1 is maximum and k=0 is the min. Same alg can clearly be used as pms FAUST{pms,mrk} 1. Find the TA record with maximum gap: WHILE RC not empty, DO 2. PA>c (c=rv+gap/2) to div RC at c into LT, GT (pTrees, PLT and PGT). 3. If LT or GT singleton {remove class) END_DO FAUST{pdq,std}(FAUST{pdq}using # of gap standard devs) 0. For each attribute, A TA(class, mn, std, n, cp) is its attribute table ordered on n asc, where cp=val in gap allowing max # of stds, n. n satisfies: mean+n*std=meanG-n*stdG so n=(mnG-mn)/(std+stdG) WHILE RC not empty, DO 1. Find the TA record with maximum n: 2. Use PA>cp to divide RC at cp=cutpoint into LT and GT (pTree masks, PLT and PGT). 3. If LT or GT singleton {remove that class from RC and from all TA's} END_DO FAUST{pms,gap} (FAUST{p} m attr cut_pts, seq class separation (1 class at time, m=1 0. For each A, TA(class, rv, gap, avgap), where avgap is avg of gap and previous_gap (if 1st avgap = gap). If x classes. DO x-1 times 1. Find the TA record with maximum avgap: 2. cL=rv-prev_gap/2. cG=rv+gap/2, masks Pclass=PA>cL&PAcG&PRC PRC=P'class&PRC (If 1st in TA (no prev_gap), Pclass=PAcG&PRC. Last, Pclass=PA>cL&PRC. 3. Remove that class from RC and from all TA's END_DO FAUST{pms,std} (FAUST{pms}using # gap std 0. attr, A TA(class, mn, std, n, avgn, cp) ordered avgn asc cp=cut_point (value in gap which allows max # of stds, n, (n satisfies: mn+n*std=mnnext-n*stdnext so n=(mnnext-mn)/(std+stdt) DO x-1 times 1. Find the TA record with maximum avgn: 2. cL=rv-prev_gap/2. cG=rv+gap/2 and pTree masks Pclass=PA>cL& PAcG&PRC PRC =P'class&PRC (If class 1st in TA (has no prev_gap), then Pclass =PAcG&PRC. If last, Pclass =PA>cL&PRC.) 3. Remove that class from RC and from all TA's END_DO

Near Neighbor Classifiers and FAUST 2011_04_23 Faust is really a Near Neighbor Classifier (NNC) in which, for each class, we construct a big box neighborhood (bbn) which we think, based on the training points, is most likely to contain that class and least likely to contain the other classes. R aR bR G aG bG R aR bR In the current FAUST, each bbn is a coordinate box, i.e., for coordinate (band) R, coordinate_box cb(R,class,aR,bR) is the set of all points, x, such that aR < xR < bR (either of aR or bR can be infinite). Either or both of the < can be . The values, aR and bR are what we have called the cut_points for that class. bbn's are constructed using the training set and applied to the full set of unclassified pixels. The bbn's are always applied sequentially, but can be constructed either sequentially or divisively. In case the construction is sequential, the application sequence is the same as the construction sequence (and the application for each class, follows the construction for that class immediately. i.e., before the next bbn construction): All pixels in the first bbn are classified into that first class (the class of that bbn). All remaining pixels which are in the second bbn are classified into the second class (class of that bbn), etc. Thus, iteratively, all remaining unclassified pixels which are in the next bbn are classified into its class. The reason cn's are applied sequentially is that they intersect. Thus, the first bbn should be the strongest in some sense, then the next strongest, then the next strongest, etc. In each round, from the remaining classes, we construct FAUST cn's by choosing the attribute-class with the maximum gap_between_consecutive_mean_values, or the maximum_number_of_stds_between_consecutive_means or the gap_between_consecutive_means allowing the minimum rank (i.e., the "best remaining gap"). Note that mean can be replaced by median or any representer. We could take the bbn's to be "multi-coordinate_band" or mcb, of the form, the INTERSECTION of the "best" k (k  n-1, assuming n classes ) cb's for a given class (where "best" can be with respect to any of the above maximizations). And instead of using a fixed number of coordinates, k, we could use only those coordinates in which the "quality" of its cb is higher than a threshold, where "quality" might be measured many ways involving the dimensions of the gaps (or other ways?). Many pixels may not get classified (this hypothesis needs testing!). It should be accurate though.

G R B Near Neighbor Classifiers and FAUST-2 We note that mcb's are used for vegetation indexing: high green ( aG high and bG = , i.e., all x such that xG > aG ) and low red ( aR = - and bR low, i.e., all x such that xR < bR) is the standard "vegetation index" and measures crop health well. So, if in instead of predicting grass if we were predicting lush grass, we could use vi, which involves mcb bbn's Similarly mcb bbn's would be used for any color object which is not pure (in the bands provided). Therefore a "blue-red" car would ideally involve a bbn that is the intersection of a red cn and a blue cn. Most paint colors are not pure. Worse yet, what does pure mean? Pure only makes sense in the context of the camera taking the image in the first place. The definition of a pure color in a given image is a color entirely within one band (column) of that image dataset (with all other bands showing zero values only). So almost all actual objects would be multi-color objects and would require, or at least benefit from, a multi-cn bbn approach.

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 Appendix Note on problems: Difficult separations problem: e.g., white cars from white roofs. Include as feature attributes, the pixel coordinate value columns as well as the bands. If the color is not sufficiently different to make the distinction (and no other non-visible band makes the distinction either) and if the classess are contiguous objects (as they are in Aroura), then because the white car training points are [likely to be] far from the white roof training points, FAUST may still work well, using x and y pixel coordinates as additional feature attributes (and attributes such as "shape", edge_sharpness, etc., if available). CkNN applied to nbrs taken from the training set, should work also. Noise Class Problem: In pixel classification, there's may be a Default_Class or NOise, N) (Aurora classes are Red_Cars, White_Cars, Black_Cars, ASphalt, White_Roof, GRass, SHadow and in the "Parking Lot Scene" case at least, there does not appear to be a NOise_class - i.e., every pixel is in one of the 7 classes above). So, in some cases, we may have 8 classes {RC, WC, BC, AS, WR, GR, SH, NO}. Picking out NO may be a challenge for any algorithm if it contains pixels that match training pixels from several of the legitimate classes - i.e., if NO is composed of tuples with values similar to other classes (Dr. Wettstein calls this the "red shirt" problem - if a person has a red shirt and is in the field of view, those pixels may be electromagnetically indistinguishable from Red_Car pixels. In that case, no correct algorithm will distinguish them electromagnetically (using only reflectance bands). Such other attributes as x and y position, size and shape (if available) etc. may provide a distinction. Using FAUST{seq}, where we maximize the: 1. size of gap between consecutive means or 2. maximize the number of stds in the gap between consecutive means or 3. minimize the K which produces no overlap (betweeen the rankK set and the rank(n-K+1) set of the next class) in the gap between consecutive classes instead taking as cut_point, the point produced by that maximization, we should back off from that and narrow the interval around that class mean by going only a fraction either way (some parameterized fraction), which would remove many of the NC points from that class prediction. Inconsistent ordering of classes over the various attributes (columns) may be an indicator of something?

attributes or columns sLN 1 sWD 2 pLN 3 pWD 4 49 47 46 54 54 46 50 44 49 54 30 32 31 36 39 34 34 29 31 37 14 13 15 14 17 14 15 14 15 15 2 2 2 2 4 3 2 2 1 2 se classes 64 69 55 65 57 63 49 66 52 50 32 31 23 28 28 33 24 29 27 20 45 49 40 46 45 47 33 46 39 35 15 15 13 15 13 1 10 13 14 10 ve 58 71 63 65 76 49 73 67 72 65 27 30 29 30 30 25 29 25 36 32 51 59 56 58 66 45 63 58 61 51 19 21 18 22 21 17 18 18 25 20 vi Attr-Class-Set, ACS(sWD, vi) An old version of the basic alg. I took the first 40 of setosa, versicolor and virginica and put the other 30 tuples in a class called "noise". 1. Sort ACS's asc by median gap=rankK(this class)-rank(n-K+1)(next class) 2. Do Until ( rankK(ACS)  rank(n-K+1)(next higher ACS in same A) | K=n/2 ) 3. Find gap, except Kth4. K=K-1; END DO; return K for each Att, Class pair. Build ACS tables (gap>0). cut_pt=rankK+S*(gap), S=1. Minimize K. TpLN clmdKrnKgap se 15 10 19 3 no 42 17 41 2 ve 44 5 48 1 vi 56 TsLN clmdKrnKgap se 50 12 52 1 no 57 12 51 1 ve 60 15 62 1 vi 64 TsWD clmdKrnKgap ve 28 20 29 230 vi 30 10 28 1 no 30 12 31 1 se TpWD clmdKrnKgap se 2 7 4 2 no 12 16 12 1 ve 14 5 16 1 vi 20 1st pass produces a tie for min K, in (pLN, vi) and (pWD, vi) (note: in both vi doesn't have higher gap since it's highest). Thus we can take both - either AND the conditions or OR the conditions. If we OR the conditions ( PpLN,vi 48) | (PpWD,vi  16) get perfect classification [and if AND get 5 mistakes]: recompute TpLN clmdKrnKgap se 15 10 19 3 no 42 17 41 2 ve 44 TsLN clmdKrnKgap se 50 12 52 1 no 57 12 51 1 ve 60 TsWD clmdKrnKgap ve 28 15 29 1 no 30 12 31 1 se TpWD clmdKrnKgap se 2 7 4 2 no 12 16 12 1 ve 14 min K in (pWD, vi). PpWD,vi5 get 9 mistakes. TpLN clmdKrnKgap no 42 17 41 2 ve 44 TsLN clmdKrnKgap no 57 12 51 1 ve 60 TsWD clmdKrnKgap ve 28 15 29 1 no 30 TpWD clmdKrnKgap no 12 16 12 1 ve 14 min K in (sLN, no). PpWD,vi51 get 12 mistakes. FAUST{seq,mrk} VPHD Set of training values in 1 col and 1 class called Attribute-Class-Set, ACS. K(ACS)=|ACS| (all |ACS|=n=10 here). In the alg below, c=root_count and ps=position (there's a separate root_count and position for each ACS and each of K and n-K+1 for that ACS. So c=c( attr, class, K|(n-K+1) ). S=gap enlargement parameter (It can be djusted to try to clip out Noise Class, NC) 1. Sort ACS's asc by median gap = rankK(this class) - rank(n-K+1)(next class) 2. Do Until ( rankK(ACS)  rank(n-K+1)(next ACS) | K=0 ) 3. Find rankK and rank(n-K+1) values of each ACS (except 1st an and Kth) 4. K=K-1; END DO; return K for each Attribute, Class pair. 5. Cut_pts placed above/below that class (using values in attr): hi cut_pt=rankK+S*(higher_gap) low cut_pt=rank(n-K+1)S*(lower_gap)

rank_.7 rank_.7 rank_.7 rank_.8 rank_1 rank_.9 rank_1 rank_.9 rank_.1 rank_.1 rank_0 rank_.3 rank_0 rank_.2 rank_.3 rank_.3 1 2 2 2 2 2 2 2 3 4 44 46 46 47 49 49 50 54 54 54 13 14 14 14 14 15 15 15 15 17 20 23 24 27 28 28 29 31 32 33 2011_04_09 FAUST{pdq,mrk} algorithm, demonstrated with VPHD, Vertical Processing, Horizontal Data first : 1. For every attr and every class, sort the values asc. 2. Find and order the medians asc in TA tables. 3. Find max k s.t. rank_k_setrank_(1-k)_set =. rank_.8 rank_.9 rank_1 4. Proceed as in all FAUST algorithms - cut accordingly (pdq or pms or ???). With VPHD, sort each class in each attr, find medians (needed?), find rank_k_sets (combine this with sorting?) ... so O(n). With HPVD, we can avoid the sorting, find rank_k_sets (median is rank_.5), fill TAs entirely with a pTree program O(0). rank_0 49 50 52 55 57 63 64 65 66 69 25 25 27 29 29 30 30 30 32 36 33 35 39 40 45 45 46 46 47 49 10 10 13 13 13 14 15 15 15 16 rank_.1 rank_.2 49 58 63 65 65 67 71 72 73 76 29 30 31 31 32 34 34 36 37 39 45 51 51 56 58 58 59 61 63 66 17 18 18 18 19 20 21 21 22 25 HPVD_mrk could be made optimal since we could record exactly which k and cp gives min error (as we work toward empty rank_k_set intersection) and we could know the error set. We could use CkNN or ? on each errant sample. To see this, go through the first k/cp animation. In that looping procedure it's clear we could determine se<55 with 3 errors to be the best cp (se<54, 6 errors; se<52, 5; se<50, 5; se<49, 6 ). Note: mrk above is lazy. It takes cp to be the average of the rank values - in this case cp=53 which has 6 errors. TsLN clmdkcp se 49 .7 53 ve 60 .7 64 vi 66 TsWD clmdkcp ve 28 .7 29 vi 29 .8 30 se 33 TpLN clmdkcp se 15 1.0 25 ve 45 .9 49 vi 58 TpWD clmdkcp se 2 1.0 7 ve 13 .9 16 vi 20 1.0 7 .7 53 .7 29 1.0 25 .7 64 .8 30 .9 49 .9 16 One can see from this animation that MaxGap is probably a pretty good method most of the time (provided there is at least one good gap each step) and the MaxGapStd is even better (same proviso). This method is intended to be optimal and to deal with, e.g., non-normal distributions.

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 Ppw3 Ppw4 & Pc rc=10 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 Ppw2 Ppw1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ppw0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 se 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 se 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 se 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 se 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 se 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 se 63 33 47 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 se 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 se 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 se 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 se 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 se 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 se 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 se 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 se 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 se 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 se 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 se 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 se 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 se 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 se 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 Pc = max = 24 + 20 23 + rc=1 rc=0 rc=1 c=0; max=0;Pc=pure1; For i=4..0 { c=rc(Pc&Patt,i) if (c>0) { Pc=Pc&Patt,i max=max+2i } } return max; maximum

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 P'pw3 P'pw4 & Pc rc>0 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 P'pw2 P'pw1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 P'pw0 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 se 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 se 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 se 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 se 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 se 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 se 63 33 47 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 se 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 se 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 se 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 se 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 se 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 se 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 se 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 se 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 se 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 se 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 se 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 se 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 se 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 se 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 Pc = rc=0 min = 20 c=0; min=0;Pc=pure1; For i=4..0 { c=rc(Pc&P'att,i) if (c>0) { Pc=Pc&P'att,i else min=min+2i } } return min; minimum

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 Ppw3 P'pw3 Ppw4 & Pc rc=10 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 Ppw2 P'pw1 Ppw1 P'pw2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ppw0 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 se 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 se 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 se 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 se 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 se 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 se 63 33 47 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 se 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 se 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 se 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 se 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 se 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 se 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 se 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 se 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 se 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 se 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 se 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 se 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 se 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 se 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 Pc = rc=1 rc=2 rc=4 rc=1 rc=3 c=0; rank5=0; pos= 5; Pc=pure1; For i=4..0 //current_i =4 { c=rc(Pc&Patt,i); if (cpos) { rankK = rankK + 2i; Pc=Pc&Patt,i ; else pos = pos - c; Pc=Pc&P'att,i ; } } return rankK; 4 3 1 +22 rankK =0 + 24 0 1 2 3 return rank5 = 20 rank5 (5th largest)

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 Ppw3 Ppw4 se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 & Pc Ppw1 Ppw2 P'pw1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Ppw0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 se 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 se 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 se 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 se 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 se 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 se 63 33 47 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 se 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 se 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 se 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 se 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 se 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 se 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 se 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 se 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 se 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 se 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 se 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 se 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 se 72 36 61 25 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 se 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 P'pw4 P'pw3 P'pw2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 0 Pc = rc=10 rc= 8 rc= 1 rc= 1 rc= 9 rankK =0 + 21 c=0;rank25=0; pos=25; Pc=pure1; For i=4..0 //current_i =4 { c=rc(Pc&Patt,i); if (cpos) { rankK = rankK + 2i; Pc=Pc&Patt,i ; else pos = pos - c; Pc=Pc&P'att,i ; } } return rankK; 15 6 5 1 0 3 2 rank25=2 rank25 (25th largest)

s e 0 0 0 0 0 0 0 0 0 0 s e 0 0 0 0 0 0 0 0 0 0 s e 0 0 0 0 0 0 0 0 0 0 s e 1 1 1 1 1 1 1 1 1 1 s e 1 1 1 1 1 1 1 1 1 1 s e 1 1 1 1 1 1 1 1 1 1 s e 0 0 0 0 0 0 0 0 0 0 s e 0 0 0 0 0 0 0 0 0 0 s e 0 0 0 0 0 0 0 0 0 0 s e 1 1 1 1 1 1 1 1 1 1 s e 1 1 1 1 1 1 1 1 1 1 s e 1 1 1 1 1 1 1 1 1 1 s e 0 0 0 0 1 0 0 0 0 0 s e 0 0 0 0 1 0 0 0 0 0 s e 0 0 0 0 1 0 0 0 0 0 s e 1 1 1 1 0 1 1 1 1 1 s e 1 1 1 1 0 1 1 1 0 1 s e 1 1 1 1 0 1 1 1 0 1 s e 1 1 1 1 0 1 1 1 0 1 s e 0 0 0 0 1 0 0 0 1 0 s e 0 0 0 0 0 1 0 0 1 0 s e 0 0 0 0 0 1 0 0 1 0 s e 0 0 0 0 0 1 0 0 1 0 s e 1 1 1 1 1 0 1 1 0 1 44 44 44 44' 44' 44' 43 43 43 43' 43' 43' 42 42 42 42' 41 41 41 41' 40 40 40 40' v e 0 0 0 0 0 0 0 0 0 0 v e 0 0 0 0 0 0 0 0 0 0 v e 0 0 0 0 0 0 0 0 0 0 v e 1 1 1 1 1 1 1 1 1 1 v e 1 1 1 1 1 1 1 1 1 1 v e 1 1 1 1 1 1 1 1 1 1 v e 0 0 0 0 0 1 0 0 0 0 v e 0 0 0 0 0 1 0 0 0 0 v e 1 1 1 1 1 0 0 1 1 0 v e 1 1 1 1 1 0 0 1 1 0 v e 1 1 1 1 1 0 0 1 1 0 v e 0 0 0 0 0 1 1 0 0 1 v e 1 1 0 1 0 0 1 0 1 1 v e 1 1 0 1 0 0 1 0 1 1 v e 1 1 0 1 0 0 1 0 1 1 v e 0 0 1 0 1 1 0 1 0 0 v e 1 1 1 1 1 1 0 1 0 0 v e 1 1 1 1 1 1 0 1 0 0 v e 1 1 1 1 1 1 0 1 0 0 v e 0 0 0 0 0 0 1 0 1 1 v e 1 1 1 1 1 0 1 1 1 1 v e 1 1 1 1 1 0 1 1 1 1 v e 1 1 1 1 1 0 1 1 1 1 44 44 44 44' 44' 44' 43 43 43 43' 43' 42 42 42 42' 41 41 41 41' 40 40 40 40' v i 1 1 1 1 1 1 1 1 0 1 v i 1 1 1 1 1 1 1 1 0 1 v i 0 1 0 1 1 0 0 0 0 1 v i 0 1 0 1 1 0 0 0 0 1 v i 0 1 0 1 1 0 0 0 0 1 v i 1 0 1 0 0 1 1 1 1 0 v i 0 0 0 0 0 0 0 0 1 0 v i 0 0 0 0 0 0 0 0 1 0 v i 0 0 0 0 0 0 0 0 1 0 v i 1 1 1 1 1 1 1 1 1 1 v i 1 1 1 1 1 1 1 1 1 1 v i 1 1 1 1 1 1 1 1 1 1 v i 0 0 0 0 0 0 0 0 0 0 v i 1 0 1 1 0 0 1 1 0 0 v i 1 0 1 1 0 0 1 1 0 0 v i 1 0 1 1 0 0 1 1 0 0 v i 0 1 0 0 1 1 0 0 1 1 v i 1 1 0 0 1 1 0 0 1 0 v i 1 1 0 0 1 1 0 0 1 0 v i 1 1 0 0 1 1 0 0 1 0 v i 0 0 1 1 0 0 1 1 0 1 44 44 44 44' 43 43 43 43' 43' 42 42 42 42' 41 41 41 41' 40 40 40 40' Check HI and LO values in each class (over each attr., in general) for a LO  all other HIsor a HI  all other LOs : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 LOvi=17  HIse=4 LOvi=17  HIve=15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 So attr4=pedal_Width cutpoint at 16 separates vi and {se,ve}. Note: This cutpt appears early in loop (i=4). Can a gap be concluded at i=4? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Do concurrently over all attributes for each K until 1st gap is found This finds 1st hi or low gap, but there may be none. It could find any gap pair separating 1 class from rest (change the or to and), but there may be none either. Then take best neg gap. Can be divisive. K=1 Pc n-K+1=10 Pc se10pos= 10 ve10pos= 10 vi10pos= 10 se10rc= ve10rc= vi10rc= se1pos= 1 ve1pos= 1 vi1pos= 1 se1rc= ve1rc= vi1rc= 9 1 0 1 2 0 8 1 0 n=10,K= 1..10 rankK rank(n-K+1) att/cl, exit when class in att w same gap (hi/lo) w all other classes in att. Peal cls Rept. 0 1 9 0 0 3 4 7 9 1 9 10 4 1 5 0 1 1 5 1 10 For i=4..0 { c=rc(Pc&Patt,i); if(cpos){rankK=+=2i; Pc=Pc&Patt,i} [rank(n-K+1)+=2i;] else {pos=pos-c; Pc=Pc&P'att,i} }return 4 2 3 1 0 HI se1rnk=0 ve1rnk=0 vi1rnk=0 LO se10rnk=0 ve10rnk=0 vi10rnk=0 22 20 20 +22 +20 23 +21 +20 +24 +21 +23 +24

pTrees - Fast Horizontal Compressed Data Processing

pTrees - Fast Horizontal Compressed Data Processing

Presentation Transcript

Document Solutions

Document Imaging

Discussion Document

Receiving Document

Document Control

Document Solutions

Document Management

Document Examiner

Document Designer – Delivery Document

DOCUMENT DELIVERY ?

Source Document

Document Markup

DOCUMENT EXAMINATION

Document Analysis

Document Preparation

Document A

Document

document

Document ranking

Document Delivery

Sea Ice

Sea Ice