Data Cleansing: Filling Missing Values in Data

Data Cleansing: Filling Missing Values in Data Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan

Overview • Problems Caused • Methods for retrieving missing values • Predicting values • The average way • The probabilistic way • By leveraging the relational network structure • Conclusions CIS 764-Gaurav Chauhan

Problems Caused Following problems occur in data analysis because of missing values in the same • Summarizing variables • Computing new variables • Comparing variables • Combining variables • In Time Series Analysis CIS 764-Gaurav Chauhan

Methods for retrieving missing values • Considering average of the available values for prediction • Using probabilistic approach for value prediction • Leveraging relation network structure of the data to predict values CIS 764-Gaurav Chauhan

Predicting Values- the average way CIS 764-Gaurav Chauhan

For finding the values for year 1938 and 1942 We can calculate the rainfall for these two years as: Taking avg of rainfall of 1937 and 1939 Rainfall in 1938 = (32+25)/2 cm = 28.5 cm Taking avg of rainfall of 1941 and 1943 Rainfall in 1942 = (30+28)/2 cm = 29 cm CIS 764-Gaurav Chauhan

Predicting Values- the probabilistic way • Assume that we have n values and we are required to predict n+1th value • For every i such that i=1 to n the probability that a data instance has a value vi is p(vi) • Each of these probabilities is calculated on the bases of the frequency with which vi occurs in the data. • That said, vn+1 is picked at random such that p(vn+1= vi ) > p(vn+1= vj) If p(vi)>p(vj) CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network • This technique applies only to relational data only • The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network CIS 764-Gaurav Chauhan

Predicting Valuesby leveraging the relational network • Example 1 Book A Book C Book B Category A Category C Category B Book A Book C Book B ? (Predicted= A) Category C Category B CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network • Example 2 Teacher Student 1 Student 2 Student 3 Student 4 Age(19) ? Age(18) Age(19) (Predicted 19) CIS 764-Gaurav Chauhan

Conclusion • Missing values in the data are bad when it is used for analysis, learning or mining purposes • Various techniques aim at predicting data but none has reached a 100% accuracy • An average of 90% accuracy with which these values are predicted is still acceptable CIS 764-Gaurav Chauhan

References • www.hrs.co.nz • http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982 CIS 764-Gaurav Chauhan

Questions Anyone • I am shivering not because of nervousness but because of cold room temperature -one nervous student CIS 764-Gaurav Chauhan

Data Cleansing: Filling Missing Values in Data