490 likes | 616 Vues
Topological characterization of an image dataset with Betti numbers and a generative model. Context. Multivariate data exploration Signals , images, …. Classical ML techniques Clustering : K- Means ; Gaussian Mixture Models -> convex clusters
E N D
Topologicalcharacterization of an image datasetwith Betti numbers and a generative model. Maxime MAILLOT (Exalead) Michaël AUPETIT (CEA LIST) Gérard GOVAERT (UTC-CNRS) DataSense | 08-07-2014
Context • Multivariate data exploration • Signals , images, …. • Classical ML techniques • Clustering: K-Means; Gaussian Mixture Models -> convex clusters • Dimension reduction: Self-OrganizingMaps, MDS, PCA -> Dime Reduct artefacts imposed by the representationspace • Topological information (fromunderlying structure) : • Number of connected components • Intrinsic dimension • Topological invariants (Betti numbers) DataSense | 08-07-2014
Why topological information? • Cognition and topology • Neuronal encodingof topological information survivedDarwiniannaturalselectionshowing the importance of this information in our cognitive processes Retinotopicmap of a mouse [Hübener 2003] DataSense | 08-07-2014
Why topological information? • Topology and visual perception • Gestalt psychological theory [1920] • The whole is more than summing the parts • Law of continuity, proximity, similarity Topologicalview Statisticalview Geometricalview Underlying structure Underlyingdensity Points location or underlyingshapes Descriptive model: sampleisenough, no hypothesis about the populationunderlying the data Predictive model: Our visual system instantlyprovides a topologicalmodel of the population DataSense | 08-07-2014
Why topological information? • Mental map and topology • Topological invariants as an objectiverepresentation Objectivemap M of a building B Whateverradicallydifferent the perception process and experience of eachperson are, a topological invariant stillexistscommon to bothpersons’ mental models and the real building’smap: Theyshare the sameconnectedness Subjective map M1 of B Subjective mapM2 of B DataSense | 08-07-2014
Why topological information? • Patterns reliability and topology • A large family of transformations • Reliability • The processing pipeline from data to decisionis more likely to be a homotopy • So topological information is more likely to survive to the distortions of the pipeline • Hencetopological information is a morereliable basis for decisionfacinguncertainty Betti numbers Intrinsic dimension Probabilitydensityfunctions Geometry • IsometriesSimilaritiesHomeomorphisms Homotopies U U U Initial space DataSense | 08-07-2014
Somehints about topology • Topology in a nutshell • Whatis the differencebetween a mug and a doughnut? DataSense | 08-07-2014
Somehints about topology • Topology in a nutshell • Whatis the differencebetween a mug and a doughnut? Taste is significantly different! DataSense | 08-07-2014
Blue and brown1-cycles cannot collapse to eachother Theyform a homology group, the rank of whichis2 (b1=2) 1-cycles whichcannotcontract to a point 1-cyclewhichcan contractto a point Somehints about topology • Topological invariants • Twospaces have the sametopologyiffthey are homeomorphic to eachother, i.e.they are linkedthrough a continuousfunction H whoseinverse H-1isalsocontinuous. • Topology classifies spacesbased on theirtopologicalinvariants like the Bettinumbers Sensor space Sample of a robot’s trajectory Image of walls 1 and 2 In the robot-to-sensors distance space Measures Sensor 3 Topological Wall 1 Sensor 1 inference Wall 2 Sensor 2 # of connected components # of independent 1-cycles (tunnels) # of independent 2-cycles (cavities) (b0,b1,b2)= (1,2,1) DataSense | 08-07-2014
From sets of points to Betti numbers • Simplex family • Simplex assembly • SIMPLICIAL COMPLEX 0-simplex 1-simplex 2-simplex 3-simplex DataSense | 08-07-2014
From sets of points to Betti numbers • For any manifold V itexists a simplicialcomplexC whichishomeomorphic to V (C(V) is a triangulation of V) • Two triangulations may have the same Betti numberswhiletheir manifolds are not homeomorphic. Simplicialcomplex Computationaltopology Betti numbers DataSense | 08-07-2014
R=11 10 8 8 8 d a c R=9 8 b 8 8 d a c R b From sets of points to Betti numbers • Vietori-Ripscomplex and Betti numbers (1,2,0) (1,0,0) (37,6,0) (N,0,0) (b0,b1,b2) • Topologicalpersistence and multiscaleanalytics= persistence of topological structure throughscale [Chazal] DataSense | 08-07-2014
Restricted Delaunay complex • From manifold to triangulation [Edelsbrunner, Shah 1997] M1 M2 DataSense | 08-07-2014
Restricted Delaunay complex • From manifold to triangulation • . [Edelsbrunner, Shah 1997] M1 M2 DataSense | 08-07-2014
Restricted Delaunay complex • Alpha-shapes Moleculestopology [Edelsbrunner1994] Manifold = union of spheres Centered on the atoms’ core (alphasets the spheresradius DataSense | 08-07-2014
Topologyrepresenting Networks • TopologyRepresenting Network [Martinetz, Schulten 1994] Connect 1stand 2ndNearest Neighbor prototype of each data: Competitive Hebbian Learning (CHL) DataSense | 08-07-2014
Topologyrepresenting Networks • TopologyRepresenting Network [Martinetz, Schulten 1994] 2nd 1er Connect 1stand 2ndNearest Neighbor prototype of each data: Competitive Hebbian Learning (CHL) DataSense | 08-07-2014
Topologyrepresenting Networks • TopologyRepresenting Network [Martinetz, Schulten 1994] 2nd 1er Connect 1stand 2ndNearest Neighbor prototype of each data: Competitive Hebbian Learning (CHL) DataSense | 08-07-2014
Topologyrepresenting Networks • TopologyRepresenting Network [Martinetz, Schulten 1994] 2nd 1er Connect 1stand 2ndNearest Neighbor prototype of each data: Competitive Hebbian Learning (CHL) ROI = Order 2 Voronoi cells DataSense | 08-07-2014
Topologyrepresenting Networks • TopologyRepresenting Network [Martinetz, Schulten 1994] 2nd 1er Connect 1stand 2ndNearest Neighbor prototype of each data: Competitive Hebbian Learning (CHL) ROI = Order 2 Voronoi cells DataSense | 08-07-2014
Topologyrepresenting Networks No noise Order 2 Voronoicells Samplewithgaussian noise DataSense | 08-07-2014
A Generative model approach • When a Statisticianmeets a Topologist… • What is the probability for a HEAD if you flip a coin cut in a Moebiusstrip? Moebius strip DataSense | 08-07-2014
A Generative model approach • When a Statisticianmeets a Topologist… • What is the probability for a HEAD if you flip a coin cut in a Moebiusstrip? Moebius strip HEAD or TAIL? P( HEAD ) = ? DataSense | 08-07-2014
A Generative model approach • When a Statisticianmeets a Topologist… • What is the probability for a HEAD if you flip a coin cut in a Moebiusstrip? Moebius strip HEAD or TAIL? P( HEADACHE ) = 1 DataSense | 08-07-2014
Generative Graph [Gaillard 2010] • Statisticalgenerative model – Where the data come from? Topologicalinference from the sample to the population …fromwhich are drawnsampleswithunknownprobabilitydensity… Unknowngenerative manifoldswith possible differenttopology, different labels, and possiblyoverlapping… …corruptedwithunknown noise… …leadingto the actualdata observations. DataSense | 08-07-2014
Generative Graph [Gaillard 2010] • Statisticalgenerative model – General hypotheses …fromwhich are drawnsampleswithunknownprobabilitydensity… Unknowngenerative manifolds … Unknowngenerative manifoldswith possible differenttopology, different labels, and possiblyoverlapping… …corruptedwithunknown noise… …leadingto the actualdata observations. DataSense | 08-07-2014
Generative Graph [Gaillard 2010] • Statisticalgenerative model – Simplifiedhypotheses Unknowngenerative manifolds… …fromwhich are drawnsampleswithunknownprobabilitydensity… …corruptedwithunknownnoise… DataSense | 08-07-2014
1 0 p 1-p Delaunay graph of some prototypes with class label probability Uniformdensity over eachtopological component (vertices and edges) Gaussian noise withidentity covariance Generative Graph [Gaillard 2010] • GenerativeGaussian Graph (GGG) – Simplifiedhypotheses Unknowngenerative manifolds… …fromwhich are drawnsampleswithunknownprobabilitydensity… …corruptedwithunknownnoise… DataSense | 08-07-2014
Generative Graph [Gaillard 2010] GGG: From data to topologicalsynthesis Delaunay Multivariate data GMM Topologicalsummary Likelihood Maximization (EM) Model selection (# vertices): BayesianInformation Criterion DataSense | 08-07-2014
Generative simplicial complex [Maillot2012] • Generativesimplicesfamilly A g0 … (Pseudo-Monte Carlo estimation) DataSense | 08-07-2014
Data sampledfrom a generativegaussian simplex d= 0 d= 1 d= 2 σ= 0.1 σ= 0.2 σ= 0.5 DataSense | 08-07-2014
Generative simplicial complex Expectation-Maximization π1 < π2 < π3 < ………< πi < …… <πn DataSense | 08-07-2014 BIC max
From data to Generativesimplicialcomplex DataSense | 08-07-2014
From data to Generativesimplicialcomplex Protoypes location initializedwith GMM DataSense | 08-07-2014
From data to Generativesimplicialcomplex Delaunay complexbuilt on top of the prototypes First the edges… DataSense | 08-07-2014
From data to Generativesimplicialcomplex Delaunay complexbuilt on top of the prototypes First the edges… Then the surfaces… DataSense | 08-07-2014
From data to Generativesimplicialcomplex • Likelihoodmaximization for dimension 1 components The p proportion of eachedgeisestimatedwith EM Edgeswithtoolow proportion do not contributesignificantly to the model (wrtBayesian Information Criterion), they are prunedfrom the model DataSense | 08-07-2014
From data to Generativesimplicialcomplex • Likelihoodmaximization for dimension 2 components Proportions of both surfaces and remainingedges are estimatedwith EM, thenprunedwrt BIC DataSense | 08-07-2014
From data to Generativesimplicialcomplex • Topologicalcleaning If a simplex survived, all itsfacets are pruned. DataSense | 08-07-2014
Results (1/3) SPHERE (1,0,1,0…) TORE (1,2,1,0…) KLEIN BOTTLE (1,1,0…) DataSense | 08-07-2014
Results (2/3) • Images data COIL-100 : • 100 objects in rotation eachrepresented by 72 images (5°) with 64x64 pixels (projected by PCA on the 71 first principal components) • O 2D simplices • Delaunay complexonlycomputed for 1D then 2D elements in the 71D space • Werecover a cycle structure DataSense | 08-07-2014
Results (3/3) • Images data COIL-100 : • Expected Betti numbers (1,1,0 …) • (1,2,0 …) correspond to an 8 shape • The (1,n,0 …) shows thatmany faces of the objects look similar • (1,0,0,…) shows a rotatioal invariant object Example for (1,2,0,…) (like an 8) DataSense | 08-07-2014
Conclusions • GSC: first generative model to extract Betti numbersfrom a data set • No meta-parameter to tune (EM + BIC) DataSense | 08-07-2014
Perspectives • Topologicalanalysis for eachconnected component separately • Algorithmicimprovements (pseudo-monte-carlo, pruning…) • Link BIC optim al and Betti numbers • Deep Networks : how topological invariants couldbeexplicitelyencodedwithineach layer? DataSense | 08-07-2014
Thankyou for your attention • MA, Learning Topology with the Generative Gaussian Graph and the EM algorithm. NIPS 2005 Conference proceeding, pp.83-90, 2006. • Gaillard Pierre, MA, Gérard Govaert. Learning topology of a labeled data set with the supervised generative Gaussian graph. Neurocomputing, 71(7-9): 1283-1299, Elsevier March 2008 • Maillot Maxime, MA, Gérard Govaert. Extraction of Betti numbers based on a generative model. ESANN 2012 • Maillot Maxime, MA, Gérard Govaert. The Generative Simplicial Complex to extract Betti numbers from unlabeled data. Workshop at NIPS 2012 • Questions? DataSense | 08-07-2014
QUESTIONS • Pourquoi un modèle de bruit isovarié? -> pour la complexité du modèlesoitattrapée par le complexesimplicial et les nombres de Betti • Pourquoi les nombres de Betti? La connexiotésemblesuffire pour les applications ? Formeprise par les états d’un systèmedynamique (épilepsie / cas normal-alerte-catastrophe… ) pas de casréelmaismise au point d’un modèle/système de mesure. DataSense | 08-07-2014
Suggestions: • - comparaisontopologie ND vstopologie 2D pour évaluationdistorsions de projections • - systèmedynamiquechangeant de forme et dont la formeindiquel’état (bon, alerte, mauvais) • - analyse/caractérisationtopologique de données • - contrôle de passage dans zone d’alerte (systèmedynamiquedont on observe l’étatbruité) on veutvérifierquel’on ne peut pas passer directement d’un état bon à un étatmauvais sans passer par l’étatd’alerte: extension du SGGG au cas des CS: trousdans la structure = fuite possible A CLARIFIER • - Cas de l’analyse de locuteurssur les lettres (triangle NSI2000): utiliser un locuteurcommesommet du GSC et positionner les autres par rapport à lui, détecter la forme des lettresprononcées NON NE MARCHE PAS la formeestsimilaire à unehomothétieprès DataSense | 08-07-2014