1 / 21

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions. Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester, MA, USA ACM CIKM 2006, Arlington, VA, USA.

kyria
Télécharger la présentation

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester, MA, USA ACM CIKM 2006, Arlington, VA, USA

  2. Introduction • Clustering often groups data with mixed attributes • Numeric • Categorical • Ordinal • Examples: PDAs, Web Pages, Scientific Experiments • Cluster Representatives: depictions of each cluster • Randomly selected representatives not enough in • Capturing cluster information • Providing ease of interpretation • Incorporating different user interests • Need for Designing Cluster Representatives

  3. Motivating Example • Scientific experiments clustered based on results • Clustering criteria learned based on input conditions • Representative of conditions used to characterize a cluster • Problem with randomly selected representative • Distinct combinations of conditions could lead to a given cluster Decision tree learning the clustering criteria (Heat Treating of Materials)

  4. Goals • Need to Design Semantics-Preserving Cluster Representatives that • Capture relevant information in cluster • Avoid visual clutter and are easy to interpret • Take into account various user interests in targeted applications

  5. Proposed Approach: DesCond Given: Clusters of experiments, conditions leading to clusters Define notion of distance for conditions incorporating domain semantics Build candidate representatives with increasing levels of detail Compare candidates using MDL-based encoding capturing user interests Return candidate with lowest encoding as best for each cluster

  6. Main Tasks in DesCond • Defining a notion of distance for the input conditions • Obtaining suitable candidate representatives for each cluster • Proposing an encoding to compare candidates and find a winner

  7. Notion of Distance • Example: Heat Treating of Materials • Quenchant: Cooling Medium • Part: The material being treated • Probe: Characterizes shape, dimension • Oxide: Thickness of oxide on surface • Agitation: Extent of agitation of cooling medium • Quenchant Temperature: Starting temperature of cooling medium • Define domain-specific distance metric for conditions incorporating • Data types of attributes • Distance between attribute values • Weights of the attributes

  8. Data Types of the Attributes • Categorical • Characters or strings with descriptive information • E.g., Quenchant Name, Part Material, Probe Type • Numerical • Integers or real numbers • E.g., Quenchant Temperature • Ordinal • Where order matters • E.g., Oxide Layer, Agitation Level

  9. Distance Between the Attribute Values • Categorical • Different = 1 • Same = 0 • Numerical • Absolute difference between • Values or • Mean values of ranges • Ordinal • Map values to integer • E.g., Oxide Layer: none = 0, thin =1, thick = 2 • Absolute difference between mapped values

  10. Weights of the Attributes • Attribute has higher weight if it • Is at higher level in tree • Belongs to a shorter path • Has more experiments in its corresponding cluster • Decision Tree Weight Heuristic Wi = 1/P ∑j=1 to P (Hi,j / Hj) * Gj

  11. Candidate Representatives in Levels of Detail • Level 1: Single Conditions Representative (SCR) • One set of conditions preserving cluster information • Level 2: Multiple Conditions Representative (MCR) • Summary of information in cluster • Level 3: All Conditions Representative (ACR) • All information in cluster abstracted suitably

  12. Single Conditions Representative Input conditions in Cluster A SCR for Cluster A • Return set of conditions closest to all others in cluster • Notion of distance: Domain-specific distance metric for conditions

  13. Multiple Conditions Representative Cluster A • Build sub-clusters of condition using domain knowledge • Return nearest sub-cluster representatives • Sort them Sub-clusters within Cluster A MCR for Cluster A

  14. All Conditions Representative Cluster A ACR for Cluster A • Return all sets of conditions • Sort them in ascending order

  15. DesCond Encoding to Compare Candidates • Analogous to Minimum Description Length (MDL) • Theory: representative, Examples: Sets of conditions in cluster • Complexity of representative (ease of interpretation) Complexity = log2 AV • A= number of attributes, V= number of values for each attribute • Distance of all items from representative (information loss) Distance = log2 (1/s)∑{i=1 to s} D(R,Si) • D: domain-specific distance metric for conditions • s: total number of items (sets of conditions) in cluster • Si: each individual item • R: representative set of conditions • DesCond Encoding Effectiveness= UBC*Complexity + UBD*Distance • UBC, UBD: User bias % weights for complexity and distance

  16. Evaluation of DesCond with Domain Expert Interviews • Evaluated with real data in Heat Treating • User Bias weights in Encoding reflect interests in targeted applications • Different data sets and number of clusters • For each data set score calculated as follows • Consider winning candidate for each cluster • Based on DesCond Encoding • Score: Number of clusters in which candidate is winner • Example: Dataset of size 25 with 5 clusters • If SCR wins for 2 clusters, ACR for 3 • Score: SCR=2, ACR=3

  17. Evaluation Results • Details • Data Set Size = 400, Number of Clusters = 20 • Experts provide UBC / UBD values in Encoding • Observations • Overall winner is MCR • As weight for complexity increases, SCR wins • Designed better than Random

  18. Evaluation with Formal User Surveys • DesCond used to design representatives for a trademarked estimation tool [ref CHTE: Center for Heat Treating Excellence] • Formal user surveys conducted in different applications of the system • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate

  19. Evaluation Results • Different winners in different applications • Results of surveys tally with those of Encoding-based evaluation • Estimation Accuracy: 90 to 94% (better than earlier versions of tool) Parameter Selection Applications Simulation Tool Applications Intelligent Tutoring Applications Decision Support Applications

  20. Related Work • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives

  21. Conclusions • Contributions of this work • Designing cluster representatives for scientific input conditions in levels of detail • Defining a domain-specific distance metric for conditions • Proposing an encoding to compare representatives • Conducting evaluation using encoding with real data from Heat Treating • Assessing use of representatives in applications of a CHTE trademarked estimation tool • Results • Designed Representatives better than random • Different designed representatives suit different applications • DesCond enhances accuracy of estimation tool

More Related