160 likes | 172 Vues
This study delves into the realm of data analysis, focusing on machine learning, clustering, classification, and continuity concepts while exploring proximity sets and non-isotopic vector space structures. It also discusses collaborative filtering and proximity structures in the context of predicting user ratings in collaborative filtering systems and analyzing remotely sensed imagery domains.
E N D
Spatial Proximity of Structural Data Attributes Maria Canton, William Perrizo Dept. of CS, North Dakota State University. CATA 2007 – Honolulu, Hawaii
Data analysis can be broken down into two parts, Querying and Data Mining. Data Mining can be broken down into 2 parts, Machine Learning and Association Rule Mining Machine Learning can be broken down into 2 parts, Clustering and Classification. Clustering can be broken down into 2 parts, Isotropic (round clusters) and Density-based
So Machine Learning begins by identifying Near Neighbor Set(s), NNS. In Isotropic Clustering, round sets are identified (disk shaped Near Neighbor Sets about a center). In Density Clustering, cores are identified (dense NNSs) then pieced together by overlap. Classification is always based on continuity which is necessarily Near Neighbor Set based.
Classification We classifying a sample based on its NNS class histogram (AKA, k Nearest Neighbor or kNN classification) or We identify isotropic NNSs of centroids (AKA, k-means) or We build decision trees whose leaves are disjoint Training Subsets whose histograms classify samples falling to that leaf or we find class boundaries (e.g. SVM) which distinguish NNSs in one class from rest.
Continuity Recalling the definition of continuity: >0 >0 : d(x,a)< d(f(x),f(a))< or said using Near Neighbor Sets, NNS about f(a) NNS about a that maps inside it. In a Database, class values are descrete ( finite) and thus Nearest Neighbor Sets (Proximity Sets) are fundamental to Machine Learning.
Near Neighbor Sets of a set Given a similarity, s:RRReals (e.g., s(x,y) = s(y,x) and s(x,x) s(x,y) x, y R ) and an extension to disjoint subsets of R (e.g., single link / complete link / average link...) and C R, a k-disk of Cis (a k Nearest Ngbr Set of C) disk(C,k) C : |disk(C,k)C'| = k and s(x,C) s(y,C) xdisk(C,k), ydisk(C,k)
C C r1 C For C = {a} r1 r1 a r2 r2 skin(C,k) disk(C,k) - C skin stands for "s k immediate neighbors" and is also a kNNS of C cskin(C,k) allskin(C,k)sclosed skin, and ring(C,k)= cskin(C,k) - cskin(C,k-1) disk(C,r1) {xR | s(x,C)r1}, skin(C,r1) disk(C,r1) - C ring(C,r2,r1) disk(C,r2) - disk(C,r1) skin(C,r2) - skin(C,r1). Given a [psuedo] distance, d, rather than a similarity, just reverse all inequalities.
xyshad xyshad xoyy = xoyy = xoyy |y| |y| yoy |y|2 y x A useful non-isotopic vector space proximity structure Theshadow vector made by a vector x on another vector y, denoted xyshadow or just xyshad is the dot product of x with a unit vector in the y direction times that unit vector.
xyshad y xyperp x The perp vector Theperpendicular vector made by a vector x on another vector y, denoted xyperpendicular or justxyperp = difference of x and its yshadow. xyperp x - xyshad |xyperp|2 = |x|2 - |xyshad|2 xyshad (xyperp) are linear in x
xyshad y xyperp x Proximity Structures based on shad and perp In collaborative filtering, e.g., predicting the rating, um, of a movie, m, by a user, u, from ratings given by users, v, let's consider users as spatial vectors of ratings over movie dimensions ( Netflix prize) The other users, v, provide signals for predicting um. Note that a user, v, whose ratings are: vn= un+1 for all movies, n, that u has already rated, is just as strong a prediction signal as one with exactly matching ratings, vn= unnSupp(u) In standard collaborative filtering, such vs (I will call them +1 signals) are filtered out as not being proximal to u.
xyshad xyshad vm- (1/n)SignedLength(v-u)shad = xoyy yoy y(1..1)=1|y|2=nxshad= xo11=kxk1=x1xperp=x-x1 y=1 n n xyperp x=v-u Pure Signals in Collaborative Filters proximity structures Filter out all collaborators except exact match signals, +1 signals and -1 signals (collectively called pure signals), as non-proximal? For this we use y=(1,1,...,1) RatingPrediction-v = SignedLength(v-u)shad= |v-u|cos = (v-u)o(1/n ) = vo1/n -uo1/n = vk/n -uk/n = (n) (v-u) xyperp x - xyshad
Remotely Sensed Imagery domains • Spatial domain functionals, used in analyzing remotely sensed imagery, take into account pixels’ structural attributes as well as neighborhood conditions. • Using the programming utility, TM-Mine, we find the following.
VI (vegetation index) NDVI (normalized difference) TVI (transformed veg index) NIR / R (NIR – R) / (NIR + R) {[(G-B)/(G+B)+0.5]^0.5}*100 P4 3.0GHz – dataset size of 2.10 X 10E8 142.5 seconds 307.5 seconds 442.0 seconds Execution Times for Band Functionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels
Execution Times for Band Funtionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels