Spatial Proximity of Structural Data Attributes

Spatial Proximity of Structural Data Attributes Maria Canton, William Perrizo Dept. of CS, North Dakota State University. CATA 2007 – Honolulu, Hawaii

Data analysis can be broken down into two parts, Querying and Data Mining. Data Mining can be broken down into 2 parts, Machine Learning and Association Rule Mining Machine Learning can be broken down into 2 parts, Clustering and Classification. Clustering can be broken down into 2 parts, Isotropic (round clusters) and Density-based

So Machine Learning begins by identifying Near Neighbor Set(s), NNS. In Isotropic Clustering, round sets are identified (disk shaped Near Neighbor Sets about a center). In Density Clustering, cores are identified (dense NNSs) then pieced together by overlap. Classification is always based on continuity which is necessarily Near Neighbor Set based.

Classification We classifying a sample based on its NNS class histogram (AKA, k Nearest Neighbor or kNN classification) or We identify isotropic NNSs of centroids (AKA, k-means) or We build decision trees whose leaves are disjoint Training Subsets whose histograms classify samples falling to that leaf or we find class boundaries (e.g. SVM) which distinguish NNSs in one class from rest.

Continuity Recalling the definition of continuity: >0 >0 : d(x,a)<  d(f(x),f(a))< or said using Near Neighbor Sets,  NNS about f(a)  NNS about a that maps inside it. In a Database, class values are descrete ( finite) and thus Nearest Neighbor Sets (Proximity Sets) are fundamental to Machine Learning.

Near Neighbor Sets of a set Given a similarity, s:RRReals (e.g., s(x,y) = s(y,x) and s(x,x)  s(x,y) x, y  R ) and an extension to disjoint subsets of R (e.g., single link / complete link / average link...) and C  R, a k-disk of Cis (a k Nearest Ngbr Set of C) disk(C,k)  C : |disk(C,k)C'| = k and s(x,C)  s(y,C) xdisk(C,k), ydisk(C,k)

C C r1 C For C = {a} r1 r1 a r2 r2 skin(C,k) disk(C,k) - C skin stands for "s k immediate neighbors" and is also a kNNS of C cskin(C,k) allskin(C,k)sclosed skin, and ring(C,k)= cskin(C,k) - cskin(C,k-1) disk(C,r1) {xR | s(x,C)r1}, skin(C,r1) disk(C,r1) - C ring(C,r2,r1) disk(C,r2) - disk(C,r1)  skin(C,r2) - skin(C,r1). Given a [psuedo] distance, d, rather than a similarity, just reverse all inequalities.

xyshad xyshad  xoyy = xoyy = xoyy |y| |y| yoy |y|2 y  x A useful non-isotopic vector space proximity structure Theshadow vector made by a vector x on another vector y, denoted xyshadow or just xyshad is the dot product of x with a unit vector in the y direction times that unit vector.

xyshad y xyperp  x The perp vector Theperpendicular vector made by a vector x on another vector y, denoted xyperpendicular or justxyperp = difference of x and its yshadow. xyperp x - xyshad |xyperp|2 = |x|2 - |xyshad|2 xyshad (xyperp) are linear in x

xyshad y xyperp  x Proximity Structures based on shad and perp In collaborative filtering, e.g., predicting the rating, um, of a movie, m, by a user, u, from ratings given by users, v, let's consider users as spatial vectors of ratings over movie dimensions ( Netflix prize) The other users, v, provide signals for predicting um. Note that a user, v, whose ratings are: vn= un+1 for all movies, n, that u has already rated, is just as strong a prediction signal as one with exactly matching ratings, vn= unnSupp(u) In standard collaborative filtering, such vs (I will call them +1 signals) are filtered out as not being proximal to u.

xyshad xyshad vm- (1/n)SignedLength(v-u)shad = xoyy yoy y(1..1)=1|y|2=nxshad= xo11=kxk1=x1xperp=x-x1 y=1 n n xyperp  x=v-u Pure Signals in Collaborative Filters proximity structures Filter out all collaborators except exact match signals, +1 signals and -1 signals (collectively called pure signals), as non-proximal? For this we use y=(1,1,...,1) RatingPrediction-v = SignedLength(v-u)shad= |v-u|cos = (v-u)o(1/n ) = vo1/n -uo1/n = vk/n -uk/n = (n) (v-u) xyperp x - xyshad

Remotely Sensed Imagery domains • Spatial domain functionals, used in analyzing remotely sensed imagery, take into account pixels’ structural attributes as well as neighborhood conditions. • Using the programming utility, TM-Mine, we find the following.

VI (vegetation index) NDVI (normalized difference) TVI (transformed veg index) NIR / R (NIR – R) / (NIR + R) {[(G-B)/(G+B)+0.5]^0.5}*100 P4 3.0GHz – dataset size of 2.10 X 10E8 142.5 seconds 307.5 seconds 442.0 seconds Execution Times for Band Functionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels

Execution Times for Band Funtionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels

Execution Times of Pixel-Matching 1 to 6 Bands

Thankyou

Spatial Proximity of Structural Data Attributes

Spatial Proximity of Structural Data Attributes

Presentation Transcript

Spatial Data What is special about Spatial Data?

Spatial Data

Spatial Data Analysis

Spatial Data

Spatial preprocessing of fMRI data

Spatial Data Diversity

Spatial data Visualization spatial data Ruslan Bobov

Representation of spatial data

Spatial Data Formats

Editing Spatial Data

“ Incorporating Spatial Proximity in Cluster Analysis ”

Representation of spatial data

Academic knowledge externalities: spatial proximity and networks

Spatial Data Formats

Spatial data models

Spatial Data Custodianship

Attributes Data

Main Attributes of Plastic Parts Built Using Structural Foam

Spatial Data What is special about Spatial Data?

Indexing Spatial Data