Knowledge Mining and Soil Mapping using Maximum Likelihood Classifier with Gaussian Mixture Models

Knowledge Mining and Soil Mapping using Maximum Likelihood Classifier with Gaussian Mixture Models ECE539 final project Instructor: Yu Hen Hu Fall 2005 Jian Liu 12/13/2005

Overview This study deals with data mining from soil survey maps and soil mapping with mined soil-landscape knowledge.

Soil – landscape models • Soil is a product of the interaction of surrounding environments • “soil-landscape model” (Hudson, 1992) • Soil can be predicated given the environments

Environmental variables • Environmental factors affecting soil formation: • Bedrock geology • Elevation (DEM) • Slope gradient • 1st derivative along the steepest slope • Profile curvature • 2nd derivative along the steepest slope • Planform curvature • 2nd derivative perpendicular to contour lines

Previous Approaches & Problems • Fuzzy system (Zhu 2001) • Elicits knowledge from a soil scientist and represents it with arbitrary curves • Assumes independence of each environmental variable • ANN (Zhu 2000; Behrens 2005; Scull 2005 ) • Black box knowledge representation • High dimensional matrix is hard to comprehend • Decision trees (Bui, 1999; Qi et.al. 2003) • Knowledge extracted is crisp (typical case), no information about gradation

Proposal – Knowledge Representation GMM representation is more suitable because: • Probability representation well captures the physical gradation of the phenomenon • The interactions between environmental variables are taken into account by the multivariate Gaussian distribution • Mixture model gives a great potential of capturing the real distribution • Physically a soil type may have multiple instances.

Proposal – Maximum Likelihood Classifier • Maximum likelihood • P(A|Class1) = 0.8 • P(A|Class2) = 0.5 • A then is classified into class1 based on“Maximum likelihood” • Naturally evaluates the composite effect environmental variables have on the probability of soil formation

Algorithm Training procedure: Testing procedure:

geology elevation slope gradient profile curvature planform curvature soil map Case Study Training set Testing set … elevation soil map geology

Evaluation of the GMM representation The GMM representations well capture the gradation of soil on the landscape, which complies well with expert knowledge e.g. Council at footslope e.g. Elbaville at backslope

Training accuracy & testing accuracy • Overall, 80% classification accuracy against testing data • Increasing number of mixtures leads to higher classification accuracy • at an expense of exponentially increasing storage and computational load classification accuracy (%)

Classification Accuracy vs. # of Mixtures

Mapping accuracy based on field data • 64 points are correctly classified out of 83 field sample points (77%), higher than traditional manual based soil survey (usually 60%) Classification result using 8 mixtures (the dark blue areas are not mapped)

More comments • Standardization of feature dimensions is very effective, -- improves mapping accuracy from 55% to 80% • Preprocessing techniques such as data cleaning required by decision tree is not critical to ML because the ML classifier is not as sensitive to training errors as long as theyare not of a huge amount.

Conclusion • GMM is suitable to represent soil-landscape knowledge • ML classifier with GMMs is promising for soil knowledge mining and soil mapping

Future improvement? • Reduce the storage and computational load so that bigger number of mixtures can be used to improve classification accuracy • Use diagonal matrix to replace full covariance matrix (after applying de-correlation to the features)?

Knowledge Mining and Soil Mapping using Maximum Likelihood Classifier with Gaussian Mixture Models