280 likes | 302 Vues
This research paper presents an efficient method for spatial data classification using P-trees, combining the Proximal Support Vector Machine (SVM) algorithm with geometric techniques. The method provides accuracy and efficiency in handling large-scale datasets.
E N D
Proximal Support Vector Machine for Spatial Data Using P-trees1 Fei Pan, Baoying Wang, Dongmei Ren, Xin Hu, William Perrizo 1Patents are pending on P-tree technology by North Dakota State University
OUTLINE • Introduction • Brief Review of SVM • Review of P-tree and EIN-ring • Proximal Support Vector Machine • Performance Analysis • Conclusion
Introduction • In this research paper, we develop an efficient proximal support vector machine (SVM) for spatial data using P-trees. • The central idea is to fit a binary class boundary using piecewise linear segments.
Brief Review of SVM • In very simple terms an SVM corresponds to a linear method (perceptron) in a very high dimensional feature space that is nonlinearly related to the input space. • By using kernels, a nonlinear class boundary is is transformed into a linear boundary in a high dimensional feature space, where linear methods apply. The resulting classification in the original feature space is thereby exposed.
More About of SVM • The goal of a support vector machine classifier is to find the particular hyperplane in high dimensions for which the separation margin between two classes is maximized.
More About of SVM • Recently there has been explosion of interest in SVMs, which have empirically been shown to give good classification performance on a wide variety of problems. • However, the training of SVMs is extremely slow for large scale data set.
Our Approach • Our approach is a geometric method with well tuned accuracy and efficiency by using P-trees and EIN-rings (Equal Interval Neighborhood rings). • Outliers in the training data are first identified and eliminated. • The method is local (proximal) – I.e., no training phase is required. • Preliminary tests show that the method has promise for both speed and accuracy.
Current practice: Sets of horizontal records Ptrees: vertically project each attribute; R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole is pure1? false 0 2. 1st half pure1? false 0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 3. 2nd half pure1? false 0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd ? false 0 0 0 1 0 1 01 5. 2nd half of 2nd half? true1 0 1 0 6. 1sthalf of 1st of 2nd? true1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level 7. 2ndhalf of 1st of 2nd false0 vertically project each bit pos of each attribute; processed vertically (vertical scans) compress each bit slice into a basic Ptree; Horizontally AND basic Ptrees R11 0 0 0 0 1 0 1 1 The 1-Dimensional Ptree, P11, of R11 built by recording the truth of predicate “pure 1” recursively on halves, until purity is reached. But it is pure (pure0) so this branch ends
R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 When is Horizontal Processing of Vertical Structures a good idea? • Their NOT for record-based workloads (e.g., SQL) (where the result is a set of records), changing horizontal record to vertical trees and then having to reconstruct horizontal result records, may mean excessive post processing. • They are for data mining workloads, result is often a bit (Yes/No, T/F), so no reconstructive post processing.
0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 2-Dimensional Pure1-trees Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file(e.g., high-order bit of the RED band of a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order.
1=001 55 level-3 (pure=43) 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 1 16 0 0 0 15 1 16 level-2 2 0 3 0 0 4 1 1 0 0 4 0 4 0 3 1 4 level-1 3 7=111 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 level-0 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) Counts are needed in DM.Predicate-trees are very compressed and can produce counts quickly. However, Count-trees are an alternative - each inode counts 1s in that quadrant):
Logical Operations on Ptrees(are used to get counts of any pattern) AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) (any pure1, copy subtree of the other operand to the result) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Ptree 1 Ptree 2 AND result OR result
Hilbert Ordering? • Hilbert ordering is 44-recursive tuning fork ordering (H-trees have fanout=16) • In 2-dimensions, Peano ordering is 22-recursive z-ordering (raster ordering) Somewhat better continuity characteristics, but a much less usable coordinate –to-quadrant translator.
Generalizing Peano compression to any table with numeric attributes. Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Unsorted relation Same relation, showing values in binary
Unsorted Generalized Raster Generalized Peano crop adult spam function mushroom Generalize Peano Sorting can make a big difference in classification speed Classificatin speed improvement (sample-based classifier) Using 5 UCI Machine Learning Repository data sets 120 100 80 Time in Seconds 60 40 20 0
Range predicate tree: Px>v For identifying/counting tuples satisfying a given range predicate. • v=bm…bi…b0 • Px>v = Pm opm… Pi opi Pi-1… opk+1 Pk 1) opi is if bi=1, opi is otherwise 2) k is rightmost bit position with value “0” • For example: Px >101 = (P2 (P1 P0))
Pxv • v=bm…bi…b0 • Pxv = P’m opm… P’i opi P’i-1… opk+1P’k 1) opi is if bi=0, opi is otherwise 2) k is rightmost bit position of v with “0” • For example: Px 101 = (P’2 P’1)
Equal Interval Neighborhood Rings (EIN-rings)(using L distance) 3rd EIN-ring Diagram of EIN-Ring 2nd EIN-ring C 1st EIN-ring
X r x Neighborhood Search X X r r+ x x P X1 = (x1-r-, x1 + r + ] X2 = (x2-r-, x2 + r + ] P’ X1 = (x1-r, x1+r] X2 = (x2-r, x2+r] P ^ P’ X1 = (x1-r-, x1 + r + ] X1 = (x1-r, x1+r] X2 = (x2-r-, x2 + r + ] X2 = (x2-r, x2+r] EIN-ring Based Neighborhood Search Using Range Predicate Tree
Proximal Support Vector Machine (P-SVM) 1) Find region components (proximities) using EIN-rings. 2) Calculate EIN-ring membership and find support vector pairs. 3) If the training space has d feature dimensions, calculate d-nearest boundary sentries in the training data, to determine a local boundary hyperplane segment. The class label is then determined by the unclassified sample’s location relative to the boundary hyperplane.
A B A Step 1: Find region componentsusing EIN-rings (defined above) Assume outliers are eliminated during a data cleaning process
m 1 å = M w * NBR Î Î x c r x c , r N = r 1 c Step 2:finding support vector pairsStep 3:fit boundary hyper plane • EIN-ring membership • c: component • r: radius • Support vector pair • Boundary Sentry • Boundary hyper plane + + + + + + + + + + + + + - + + - - + * - - - # * - - - - - - - - - - = + H ( x ) wx w 0
PRELIMINARY EVALUATION Aerial (TIFF) Image and Yield Map from Oaks, North Dakota
PRELIMINARY EVALUATION In each experiment run, we randomly select 10% of data set as test data and the rest as training data. P-SVM correctness appears to be comparable to standard SVM
Inf. PRELIMINARY EVALUATION P-SVM speed appears to be superior to standard SVM
CONCLUSION • In this paper, we propose an efficient P-tree based proximal Support Vector Machine (P-SVM), which appears to improve speed without sacrificing accuracy. • In the future, more extensive experiments and combination of P-SVM with KNN will be explored.