1 / 28

Proximal Support Vector Machine for Spatial Data Using P-trees 1

This research paper presents an efficient method for spatial data classification using P-trees, combining the Proximal Support Vector Machine (SVM) algorithm with geometric techniques. The method provides accuracy and efficiency in handling large-scale datasets.

perezl
Télécharger la présentation

Proximal Support Vector Machine for Spatial Data Using P-trees 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proximal Support Vector Machine for Spatial Data Using P-trees1 Fei Pan, Baoying Wang, Dongmei Ren, Xin Hu, William Perrizo 1Patents are pending on P-tree technology by North Dakota State University

  2. OUTLINE • Introduction • Brief Review of SVM • Review of P-tree and EIN-ring • Proximal Support Vector Machine • Performance Analysis • Conclusion

  3. Introduction • In this research paper, we develop an efficient proximal support vector machine (SVM) for spatial data using P-trees. • The central idea is to fit a binary class boundary using piecewise linear segments.

  4. Brief Review of SVM • In very simple terms an SVM corresponds to a linear method (perceptron) in a very high dimensional feature space that is nonlinearly related to the input space. • By using kernels, a nonlinear class boundary is is transformed into a linear boundary in a high dimensional feature space, where linear methods apply. The resulting classification in the original feature space is thereby exposed.

  5. More About of SVM • The goal of a support vector machine classifier is to find the particular hyperplane in high dimensions for which the separation margin between two classes is maximized.

  6. More About of SVM • Recently there has been explosion of interest in SVMs, which have empirically been shown to give good classification performance on a wide variety of problems. • However, the training of SVMs is extremely slow for large scale data set.

  7. Our Approach • Our approach is a geometric method with well tuned accuracy and efficiency by using P-trees and EIN-rings (Equal Interval Neighborhood rings). • Outliers in the training data are first identified and eliminated. • The method is local (proximal) – I.e., no training phase is required. • Preliminary tests show that the method has promise for both speed and accuracy.

  8. Current practice: Sets of horizontal records Ptrees: vertically project each attribute; R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole is pure1? false 0 2. 1st half pure1? false  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 3. 2nd half pure1? false  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd ? false  0 0 0 1 0 1 01 5. 2nd half of 2nd half? true1 0 1 0 6. 1sthalf of 1st of 2nd? true1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level 7. 2ndhalf of 1st of 2nd false0 vertically project each bit pos of each attribute; processed vertically (vertical scans) compress each bit slice into a basic Ptree; Horizontally AND basic Ptrees R11 0 0 0 0 1 0 1 1 The 1-Dimensional Ptree, P11, of R11 built by recording the truth of predicate “pure 1” recursively on halves, until purity is reached. But it is pure (pure0) so this branch ends

  9. R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 When is Horizontal Processing of Vertical Structures a good idea? • Their NOT for record-based workloads (e.g., SQL) (where the result is a set of records), changing horizontal record to vertical trees and then having to reconstruct horizontal result records, may mean excessive post processing. • They are for data mining workloads, result is often a bit (Yes/No, T/F), so no reconstructive post processing.

  10. 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 2-Dimensional Pure1-trees Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file(e.g., high-order bit of the RED band of a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order.

  11. 1=001 55 level-3 (pure=43) 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 1 16 0 0 0 15 1 16 level-2 2 0 3 0 0 4 1 1 0 0 4 0 4 0 3 1 4 level-1 3 7=111 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 level-0 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) Counts are needed in DM.Predicate-trees are very compressed and can produce counts quickly. However, Count-trees are an alternative - each inode counts 1s in that quadrant):

  12. Logical Operations on Ptrees(are used to get counts of any pattern) AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) (any pure1, copy subtree of the other operand to the result) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Ptree 1 Ptree 2 AND result OR result

  13. Hilbert Ordering? • Hilbert ordering is 44-recursive tuning fork ordering (H-trees have fanout=16) • In 2-dimensions, Peano ordering is 22-recursive z-ordering (raster ordering) Somewhat better continuity characteristics, but a much less usable coordinate –to-quadrant translator.

  14. 3-Dimensional Ptrees

  15. Generalizing Peano compression to any table with numeric attributes. Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Unsorted relation Same relation, showing values in binary

  16. Unsorted Generalized Raster Generalized Peano crop adult spam function mushroom Generalize Peano Sorting can make a big difference in classification speed Classificatin speed improvement (sample-based classifier) Using 5 UCI Machine Learning Repository data sets 120 100 80 Time in Seconds 60 40 20 0

  17. Range predicate tree: Px>v For identifying/counting tuples satisfying a given range predicate. • v=bm…bi…b0 • Px>v = Pm opm… Pi opi Pi-1… opk+1 Pk 1) opi is  if bi=1, opi is  otherwise 2) k is rightmost bit position with value “0” • For example: Px >101 = (P2  (P1 P0))

  18. Pxv • v=bm…bi…b0 • Pxv = P’m opm… P’i opi P’i-1… opk+1P’k 1) opi is  if bi=0, opi is  otherwise 2) k is rightmost bit position of v with “0” • For example: Px  101 = (P’2 P’1)

  19. Equal Interval Neighborhood Rings (EIN-rings)(using L distance) 3rd EIN-ring Diagram of EIN-Ring 2nd EIN-ring    C 1st EIN-ring

  20. X  r x Neighborhood Search X X r r+ x x P X1 = (x1-r-, x1 + r + ] X2 = (x2-r-, x2 + r + ] P’ X1 = (x1-r, x1+r] X2 = (x2-r, x2+r] P ^ P’ X1 = (x1-r-, x1 + r + ] X1 = (x1-r, x1+r] X2 = (x2-r-, x2 + r + ] X2 = (x2-r, x2+r] EIN-ring Based Neighborhood Search Using Range Predicate Tree

  21. Proximal Support Vector Machine (P-SVM) 1) Find region components (proximities) using EIN-rings. 2) Calculate EIN-ring membership and find support vector pairs. 3) If the training space has d feature dimensions, calculate d-nearest boundary sentries in the training data, to determine a local boundary hyperplane segment. The class label is then determined by the unclassified sample’s location relative to the boundary hyperplane.

  22. A B A Step 1: Find region componentsusing EIN-rings (defined above) Assume outliers are eliminated during a data cleaning process

  23. m 1 å = M w * NBR Î Î x c r x c , r N = r 1 c Step 2:finding support vector pairsStep 3:fit boundary hyper plane • EIN-ring membership • c: component • r: radius • Support vector pair • Boundary Sentry • Boundary hyper plane + + + + + + + + + + + + + - + + - - + * - - - # * - - - - - - - - - - = + H ( x ) wx w 0

  24. PRELIMINARY EVALUATION Aerial (TIFF) Image and Yield Map from Oaks, North Dakota

  25. PRELIMINARY EVALUATION In each experiment run, we randomly select 10% of data set as test data and the rest as training data. P-SVM correctness appears to be comparable to standard SVM

  26. Inf. PRELIMINARY EVALUATION P-SVM speed appears to be superior to standard SVM

  27. CONCLUSION • In this paper, we propose an efficient P-tree based proximal Support Vector Machine (P-SVM), which appears to improve speed without sacrificing accuracy. • In the future, more extensive experiments and combination of P-SVM with KNN will be explored.

  28. THANKS

More Related