Horatiu Dumitru , Adam Czauderna , Jane Cleland-Huang DePaul University

Using Machine Learning to Automate Fault Detection in Flight Discrepancy and Software Problem Anomaly Reports HoratiuDumitru, Adam Czauderna, Jane Cleland-HuangDePaul University This project was jointly funded by NSF REU Supplement CCF: 0936417 and a grant from Lockheed Martin via the Software Engineering Research Consortium. Dumitru.horatiu@gmail.comjhuang@cs.depaul.edu • DePaul University, Systems and Requirements Engineering Center

Flight Maintenance Hardware and software errors are reported by pilots and maintenance crew. They are entered into flight discrepancy reports and software problem anomaly reports. Flight incidents, test failures etc Flight discrepancy reports Analysts use basic search features to search through reports to find recurring problems. Software problem Anomalyreports

Problem Statement • Thousands of problem and anomaly reports are generated for each aircraft. • Searching for, and monitoring recurring faults is time consuming and relies upon the intuition of the analysts. • Many critical fault trends go undetected, leading to potential failures and loss of opportunity to mitigate problems.

A More Automated Approach Instead of relying upon analysts to search for recurring problems, utilize machine learning techniques to discover and monitor cross-cutting faults. Data mining toolsto detect occurrence of known faults and to identify new fault trends. Analyst reviews candidate faults. Flight discrepancy reports Software problem Anomalyreports Corpus of known faults Flight incidents, test failures etc

Sub-problems • We identified two sub-problems • Identifying & monitoring known problem trends • Detect recurrence of the fault • To determine when a fault has been successfully mitigated (and no longer recurs). • Identifying previously unknown problems • These problems may never have been conceived of • Once identified, a problem transitions to monitored status.

A Top Down Approach • A fault-pattern exhibits itself as a cross-cutting concern that cuts across problem reports and affects various hardware and/or software devices. • A primary concern of a software system is defined as a dominant aspect such as a specific hardware or software feature • Example: A feature to display medical records • A cross-cutting concern represents an aspect that is scattered across a number of more dominant concerns. • Example: Login feature

A Typical cross-cutting topic Cluster name:1123 - The login and password are sent to the server. - The details of the health units and specialties are retrieved. Cluster name:1092 - The system shows the specific screen for each type of complaint. - The system shows the login screen - Error message should be showed. - Show a message informing the employee of the missing/incorrect data. Cluster name:1126 - The result of the login attempt is presented to the employee on their local display. - The query results are formatted and presented to the user on their local display. Cluster name:1124 - The system retrieves the employee details using the login as a unique identifier. - The unique identifier is used to retrieve the complaint entry. - The unique identifier is used to retrieve the disease type to query. - The unique identifier is used to retrieve the list of health units which are associated with the selected specialty. - The unique identifier is used by the system to search the repository for the selected health unit.

Identify and remove dominant terms • - Thesystemretrievestheemployeedetailsusingtheloginasauniqueidentifier. • - Thesystemretrievestheemployeedetailsusing theloginas auniqueidentifier. • - Theuniqueidentifieris used toretrievethecomplaintentry. • - Theuniqueidentifieris used to retrievethediseasetype toquery. • - Theuniqueidentifieris used toretrievethelistofhealth unitswhich areassociatedwith the selected specialty. • Theuniqueidentifieris used by the systemtosearchtherepositoryfor the selectedhealth unit. Dominant terms Stop words recessive terms

Recluster around recessive terms • - The system retrieves the employee details using the login as a unique identifier. • - The system retrieves the employee detailsusing the login as a unique identifier. • - The system shows the login screen • - The login and password are sent to the server. • - The login and password are sent to the server. • - The employee provides the login and password. • - The employee provides the login and password. • The result of the login attempt is presented to • the employee on their local display. • The result of the login attempt is presented to the employee on their local display. Step 2: Dominant terms are removed and requirements are re-clustered around weaker terms.

An overview of our solution Nozzle, repair, cracked. Step 1: Preprocess data to remove stop words and stem words to root forms. Step 2: Cluster the problem reports using an unsupervised clustering method. Step 3: Compute cohesion and size metrics and use them to select the best cluster. Step 4: Identify the key terms for the selected cluster. nozzle cracked Topic list repair nozzle Step 6: Remove the identified terms from ALL problem reports Step 7: Repeat steps 2-6 until no more clusters are found. Step 8: Present topic list to analyst for review. Step 5: Create a problem topic from identified terms and add to topic list.

Step 1: Preprocessing Parse each of the feature requests to stem each word to its root form, so that similar words can be matched. Remove common words known as ‘stop-words’ as these are not useful in computing similarity between documents. Remove any words which only appear once, as these are not helpful in the clustering process. Use a term-frequency, inverse document frequency (tf-idf) model to represent each feature request a as a weighted vector of terms (t1, t2,…..,tn) Step 1: Preprocess data to remove stop words and stem words to root forms.

Step 2: Clustering Our approach uses an underlying clustering method known as SPK-Means Two-stage spherical K-means clustering Input: unlabelled instances , number of clusters K, initial centroids I, convergence condition Output: crisp K-partition . Steps: • Initialization: initialize centroids using I: ; • Batch instance assignment and centroid update until convergence • assign each instance to nearest cluster i with largest • update each centroid: • Incremental optimization of objective function until convergence: • randomly select an instance • move it to the cluster that maximizes the gain of objective function • update each centroid: Step 2: Cluster the problem reports using an unsupervised clustering method.

Step 2: Clustering (continued) A consensus approach is taken in which n clusterings are generated as follows for each clustering: 70% of the fault reports are randomly selected and clustered using SPK-Means. The remaining 70% of faults are classified into the generated clusters. A co-association matrix is generated that documents the number of times each pair of faults occur together. The faults are re-clustered using a simple hierarchical clustering scheme in which the values in the co-association matrix represent the proximities between faults. Step 2: Cluster the problem reports using an unsupervised clustering method.

Step 3: Find the best cluster • The cosine distance of each problem statement to the centroid of its cluster is computed. • For each cluster, all distances are summed. • The average distance is computed for each cluster. • These two values (2) and (3) are normalized and used to determine the best cluster. Step 3: Compute cohesion and size metrics and use them to select the best cluster. Our goal is to find the single most cohesive cluster in each iteration of the process.

Step 4: Identify key terms Weatherforecastcondit Cluster: (weather, forecast, condit)To be informed of current weatherconditions Be able to check the weather Provide a weatherforecast for the length of the traveler's stay. Need a service to show me the current weather and forecast. Local weather information. Provide local weatherconditions and forecasts. View current weatherconditions Provide weather information for various destinations Be able to know the information about Weather. Provide weatherforecasts Display the weatherforecast. Provide local weather information for the week. Step 4: Identify the key terms for the selected cluster. Please note:Due to export controls regulations and non-disclosure agreements we are unable to illustrate our approach with the Lockheed Martin Data. Instead we illustrate the user requirements for an airport kiosk. • Determine the cluster’s dominant terms: • Find terms with the highest weight according to the centroids • Only pick terms above a certain thresh-hold • Add the dominant terms to a list

Step 5: Add problem topic to list • Take the dominant terms identified from the ‘best cluster’ in the last iteration and add them as a group to the topic list. • Four sample topics from the airport kiosk: • weather, forecast, condit • reserv, hotel • destin, map, direct • flight, connect, inform Topic list Step 5: Create a problem topic from identified terms and add to topic list.

Step 6: Remove topics • Remove dominant topics from all of the problem reports Cluster: (weather, forecast, condit)To be informed of current weatherconditions Be able to check the weather Provide a weatherforecastfor the length of the traveler's stay. Need a service to show me the current weather and forecast. Local weather information. Provide local weatherconditionsand forecasts. View current weatherconditions Provide weather information for various destinations Be able to know the information about Weather. Provide weatherforecasts Display the weatherforecast. Provide local weather information for the week. Step 6: Remove the identified terms from ALL problem reports nozzle cracked repair nozzle Why? Because we would like to form additional clusters around the remaining concepts.

Step 7: Repeat • Once dominant terms have been removed, re-cluster around remaining terms. • Repeat steps 2-7 until a stopping condition is met. • Candidate stopping condition: • No additional interesting topics remain. • Individual problem statements contain only stop-words. • Note: This approach generates fuzzy clusters i.e. a single statement can be placed into multiple clusters. Step 7: Repeat steps 2-6 until no more clusters are found.

Step 8: Analysis and Review • Engineers review candidate list of problem faults. • Engineers mark each detected fault as: • Valid • Invalid • Insignificant Step 8: Present topic list to analyst for review.

Sample Results (from Airport Kiosk) Flight, connect, informProvide flight information including departure times and gate numbers. Provide up to date information about flight delays. To be informed of connecting flights To provide connecting flight information The kiosk should have secured network access to get ongoing flight’s information. Get flight status Check in for flight Provide current flight information for O’hare and other airports around the country. I need to be able to check for my flight information. Nearby, restaur To provide nearby restaurant listings To provide nearby traffic conditions Locate restaurants Display the location of food courts and other restaurants on the airport map. Provide information on nearby businesses, hotels, and restaurants, and their relation to the airport. Be able to see some reviews on the certain restaurants or hotels Make reservation at restaurant near the hotel. Create list of restaurants nearby attractions

Evaluation against Answer Set • Results from a standard metric (JACARD) used to compare to Airport kiosk clusterings. (Note this metric returns relatively low values even for fairly similar clusterings) • However our observations suggest that many of the additional clusters discovered by our tool represent good topics that were not manually discovered by human analysts. Additional evaluation is needed to confirm this hypothesis.

Results on C-130 Data • Results from our clustering process were presented to engineers at Lockheed Martin. • Engineers were well-satisfied with the results for several reasons: • The iterative clustering approach pushes the BEST clusters to the top of the list and appears to produce higher quality clusters than standard SPK-Means approaches. • In an initial review session our process identified at least one recurring fault that engineers may previously have been unaware of.

Future Steps • Incorporate techniques that use real-time feedback to improve the fault detection algorithms and decrease the ranking of rejected clusters. • Incorporate techniques such as acronym expansion and synonym recognition to reduce redundancy in results. • Deliver GUI based tools to LM that can be incorporated into their fault management process. • Onsite visit of DePaul researchers to Lockheed Martin to test same techniques on additional datasets.

AUTOMATED MINING OF CROSS-CUTTING CONCERNS FROM PROBLEM REPORTS AND REQUIREMENTS SPECIFICATIONS HoratiuDumitru, Adam Czauderna, Jane Cleland-HuangDePaul University This project was jointly funded by NSF REU Supplement CCF: 0936417 and a grant from Lockheed Martin via the Software Engineering Research Consortium. Dumitru.horatiu@gmail.comjhuang@cs.depaul.edu • DePaul University, Systems and Requirements Engineering Center

Horatiu Dumitru , Adam Czauderna , Jane Cleland-Huang DePaul University

Horatiu Dumitru , Adam Czauderna , Jane Cleland-Huang DePaul University

Presentation Transcript

DePaul University

DePaul University Transportation Program

DePaul University 2008

Adam Czauderna, Marek Gibiec , Greg Leach, Yubin Li

Claudia Fernández DePaul University

DePaul University Transportation Program

DePaul University

DePaul University

DePaul University

Depaul University

DePaul University

DePaul University

DePaul University

DePaul University Driver Awareness

DePaul University

DePaul University

DePaul University 2008

DePaul University

DePaul University

DePaul University

DePaul University

DePaul University