Interoperable Framework for Space Science Data Mining (F-MASS)

An Interoperable Framework for Mining and Analysis of Space Science Data (F-MASS) PI: Sara J. Graves Project Lead: Rahul Ramachandran Information Technology and Systems Center University of Alabama in Huntsville sgraves@itsc.uah.edu rramachandran@itsc.uah.edu http://www.itsc.uah.edu

Others Involved in the Project • Wladislaw Lyatsky and Arjun Tan (Co-PI) Department of Physics, Alabama A&M University • Glynn Germany Center for Space Plasma, Aeronomy, and Astrophysics Research, University of Alabama in Huntsville • Xiang Li, Matt He, John Rushing and Amy Lin ITSC, University of Alabama in Huntsville

Extend the existing scientific data mining framework by providing additional data mining algorithms and customized user interfaces appropriate for the space science research domain Provide a framework for mining to allow better data exploitation and use Utilize specific space science research scenarios as use case drivers for identifying additional techniques to be incorporated into the framework Enable scientific discovery and analysis Project Objectives

Overview of the “New” Mining Framework Case Study: Comparing Different Thresholding Algorithms for Segmenting Auroras Presentation Outline

ADaM 4.0Algorithm Development and Mining

Mining System: Design Objectives • Ease of Use! • Reusable Components • Simple Internal Data Model • Allow both loose and tight coupling with other applications/systems • Flexible to allow ease of use in both batch and interactive mode

Mining System Design • Each component is provided with a C++ application programming interface (API), an executable in support of scripting tools (e.g. Perl, Python, Tcl, Shell) • ADaM components are lightweight and autonomous, and have been used successfully in a grid environment • ADaM has several translation components that provide data level interoperability with other mining systems (such as WEKA and Orange), and point tools (such as libSVM and svmLight) • ADaM also includes Python wrappers • ADaM toolkit is available to all PROVIDE MINING OPERATIONS AS WEB SERVICES BUILD GENERIC APPLICATIONS BUILD CUSTOMIZED APPLICATIONS USE OPERATIONS AS STAND ALONE EXECUTABLES VIRTUAL REPOSITORY OF OPERATIONS DATA MINING IMAGE PROCESSING TOOLKIT TOOLKIT OPERATIONS

ADaM 4.0 Components

ADaM Example: Classification Process • Identify potential features which may characterize the phenomenon of interest • Generate a set of training instances where each instance consists of a set of feature values and the corresponding class label • Describe the instances using ARFF file format • Preprocess the data as necessary (normalize, sample etc.) • Split the data into training / test set(s) as appropriate • Train the classifier using the training set • Evaluate classifier performance using test set

Sample Data Set – ARFF Format

Utilities for Splitting the Samples • ADaM has utilities for splitting data sets into disjoint groups for training and testing classifiers • The simplest is ITSC_Sample, which splits the source data set into two disjoint subsets • Example: split data set into two groups, one with 2/3 of the patterns and another with 1/3 of the patterns: ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66 • The –i argument specifies the input file name • The –o and –t arguments specify the names of the two output files (-o = output one, -t = output two) • The –p argument specifies the portion of data that goes into output one (trn.arff), the remainder goes to output two (tst.arff) • The –c argument tells the sample program which attribute is the class attribute

Training the Classifier • ADaM has several different types of classifiers • Each classifier has a training method and an application method • Example: Naïve Bayes classifier ITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txt • The –i argument specifies the input file name • The –c argument specifies the name of the class attribute • The –b argument specifies the name of the classifier file:

Applying the Classifier • Once trained, the Naïve Bayes classifier can be used to classify unknown instances • For this demo, the classifier is run as follows: ITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o res_tst.arff • The –i argument specifies the input file name • The –c argument specifies the name of the class attribute • The –b argument specifies the name of the classifier file • The –o argument specifies the name of the result file

Evaluating Classifier Performance • By applying the classifier to a test set where the correct class is known in advance, it is possible to compare the expected output to the actual output. • ITSC_Accuracy is run as follows: ITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txt

Python Script for Classification

Additional Information • http://datamining.itsc.uah.edu/adam/ • Additional information • Documentation • Download ADaM 4.0 executables (Windows and Linux)

Comparing Different Thresholding Algorithms for Segmenting Auroras Case Study using 130 images from UVI observations on September 14, 1997, covering the time period from 8:30 UT and 11:27 UT

Segmenting Auroral Events • Spacecraft UV images observing auroral events contain two regions, an auroral oval and the background • Under ideal circumstances, the histogram of these images has two distinct modes and a threshold value can be determined to separate the two regions • Different factors such as the date, time of the day, and satellite position all affect the luminosity gradient of the UV image making the two regions overlap and thereby making the threshold selection a non trivial problem • Objective of this study: Compare different thresholding techniques and algorithms for segmenting auroral events in Polar UV images

Thresholding • Global Thresholding • uses a fixed threshold for all pixels in the image • works only if the intensity histogram of the input image contains neatly separated peaks corresponding to the desired subject(s) and background(s) • cannot deal with images containing, for example, a strong illumination gradient. • Adaptive/Local Thresholding • selects an individual threshold for each pixel based on the range of intensity values in its local neighbourhood • allows for thresholding of an image whose global intensity histogram doesn't contain distinctive peaks

Methodology Used in this Study • Test global thresholding technique using the following algorithms • Mixture Modeling • Fuzzy Set • Entropy • Develop adaptive thresholding technique using context information and the following algorithms • Modified Mixture Modeling • Fuzzy Set • Entropy • Edge Based Detection • Test and evaluate

Mixture Modeling • Mixture modeling thresholding algorithm assumes that the object and the background are distributed normally and the threshold is calculated by minimizing the error by fitting the two Gaussian distributions to the histogram, • The Gaussian distributions are described by: • The least square minimization function is given by

Fuzzy Sets • The fuzzy sets algorithm uses a membership function to define a fuzzy object region • In a normal set theory, a set has no elements in common between itself and its complement. • In a fuzzy set, each element may belong to a set and its complement with certain probabilities. • Yager (1979) defined a measure of fuzziness for the degree with which a set and its compliment were indistinguishable. This is given by the following expression • The gray level value that minimizes the fuzziness is the threshold value for the image.

Entropy • Entropy is a measure commonly used in information theory and characterizes the impurity of a data. The entropy of the object and the background for a given threshold t can be calculated by using: and where pj is the histogram value at gray level j • Thus, finding the threshold for the image now becomes an optimization. • The gray level value that maximizes the entropy for the sum of H0 and Hw is used as the threshold.

Edge Based Detection • The edge based method identifies the aurora region by detecting the transition zone between aurora and background • This is an adaptive technique that uses context information to subdivide the image into subzones • Using the domain knowledge that the auroras are centered on the magnetic pole, radial slices at a certain Magnetic Local Time (MLT) and starting from the magnetic pole are used to divide the image into subzones • The rate of change of the intensity along the magnetic latitude is calculated using the following formula • The transition zones show a sharp change in intensity between the aurora and background • Therefore, by detecting the maximum and minimum gradient location, the poleward and equatorward boundaries of aurora oval are identified for this MLT

Global Thresholding Result: Sept, 14, 1997 image, 08:41:53 UTC IMAGE HISTOGRAM ORIGINAL IMAGE MIXTURE MODELING (64) FUZZY SETS (132) ENTROPY (122)

Global Thresholding Result: Sept, 14, 1997 image,10:31:40 UTC ORIGINAL IMAGE IMAGE HISTOGRAM MIXTURE MODELING (43) FUZZY SETS (138) ENTROPY (130)

Adaptive Thresholding Process MLT(18) MIXTURE MODELING (91) FUZZY SETS(124) ENTROPY(117) MLT(20) MIXTURE MODELING (52) FUZZY SETS(123) ENTROPY(162)

Adaptive Thresholding Process:Edge Based Detection MLT (18) MLT( 20)

Adaptive Thresholding Results:Sept 14, 1997 image 09:05:48 UTC B C A E D A. Original Image B. Mixture Modeling C. Entropy D. Fuzzy Sets E. Gradient

Adaptive Thresholding Results:Sept 14, 1997 image 08:31:27 UTC B C A D E A. Original Image B. Mixture Modeling C. Entropy D. Fuzzy Sets E. Gradient

Analysis • Global Thresholding Techniques: • As expected, global thresholding techniques do not perform well in aurora detection • Fuzzy sets and Entropy based thresholding algorithms in general tend to overestimate the threshold and are unable to detect the entire aurora • Mixture modeling algorithm on the other hand underestimates the threshold and may label some portion of background as aurora • Adaptive Thresholding Techniques: • Adaptive thresholding techniques show better results for the given set of images as compared to global thresholding • Fuzzy set, Entropy, Edge Based Detection algorithms all are unable to do as good a job as Mixture modeling

Future Work • Investigate the behavior of these thresholding algorithms for oval boundary detection in the presence of day glow and other types of noise • Investigate the use of Chow and Kaneko (1972) technique for adaptive thresholding • Current adaptive thresholding scheme divides an image into subzones based on the MLT time. • Disadvantage: • It requires auxiliary MLT information. • The numbers of pixels in each subzone are not uniform, and at times, improper threshold is obtained • Chow and Kaneko method partitions the image into non-overlapping blocks of equal area and the threshold for each block computed independently • The thresholds for the blocks are then interpolated over the entire image • The best thresholding algorithms will be incorporated to ADaM 4.0

Interoperable Framework for Space Science Data Mining (F-MASS)