1 / 26

DATA MINING

DATA MINING. Handling Missing Attribute Values and Knowledge Discovery. Shahzeb Kamal Olov Junker Amsterdam Uppsala. DATA MINING. Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete

tannar
Télécharger la présentation

DATA MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA MINING Handling Missing Attribute Values and Knowledge Discovery Shahzeb Kamal Olov Junker AmsterdamUppsala HASCO 2014 - SHAH - OLOV

  2. DATA MINING Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete Contain errors Inconsistent HASCO 2014 - SHAH - OLOV

  3. Handling Missing Attribute values Techniques Consistency Various Algorithms KDD (Knowledge Discovery in Databases) Exploring patterns in datasets Core of KDD is DM HASCO 2014 - SHAH - OLOV BMWL

  4. Missing Data Data is not always available Machine malfunction Inconsistent with other recorded data Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Mistakenly Changed/Erased data HASCO 2014 - SHAH - OLOV

  5. How to handle missing data Goal: Rule induction Sequential method Pre process the data / Fill the missing attributes before main process (e.g. rule induction) (Rule induction: extracting rules from observing the data) Parallel method Extracting rules from the original incomplete data sets HASCO 2014 - SHAH - OLOV

  6. Sequential methods Case wise deletion HASCO 2014 - SHAH - OLOV

  7. Most common value of an attribute Most common value of an attribute restricted to a concept (concept: set of all cases with same decision) Eg. Case 1 belong to concept {1, 2, 4, 8} So, Headache = Yes HASCO 2014 - SHAH - OLOV

  8. Assigning all possible values to a missing attribute value All possible values of an attribute restricted to a concept HASCO 2014 - SHAH - OLOV

  9. Assigning mean value For symbolic attributes -> replace by common value ^_^ HASCO 2014 - SHAH - OLOV

  10. Assigning mean value restricted to a concept For symbolic attributes -> replace by common value HASCO 2014 - SHAH - OLOV

  11. Global Closest Fit We compute ‘Distance’ -> Smallest distance = Closest value • Replacing the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. Distance(x,y)= Σall cases distance(xi, yi) Where, 0 if xi=yi, Distance(xi,yi) 1 if x and y are symbolic and xi≠yi, or xi=? Or yi=? |xi-yi|/r if xi and yi are numerical values and xi≠yi HASCO 2014 - SHAH - OLOV

  12. For example Distance(1,2) = |100.2-102.6|/|102.6-96.4| +1 + 1 = 2.39 HASCO 2014 - SHAH - OLOV

  13. Concept Closest Fit • First split the dataset in data subsets with same concept • Replace the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. • Merge the data subsets HASCO 2014 - SHAH - OLOV

  14. HASCO 2014 - SHAH - OLOV

  15. Other methods of filling the missing values • Number of methods to handle missing attribute values based on the dependence between known and missing values. • Chase algorithm: for each case with missing data, a net data subset is created. Missing value = decision value, then merge the data set. • Maximum likelihood estimation. • Monte Carlo method: missing values are replaced by many possible values and then the complete data set is analyzed and the results are combined. HASCO 2014 - SHAH - OLOV

  16. Parallell methods • subsets, then rule induction • 2 types of missing values: • ”lost”: needed but gone • ”do not care”: irelevant

  17. Concepts All cases with the same decision value C1 = {1,2,4,8} C2 = {3,5,6,7} HASCO 2014 - SHAH - OLOV

  18. Parallell method, ”Lost” values • ”Lost” values does not belong to any block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,4,5} • [(Nausea, yes)] = {2,4,5,7} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4}, K(5) = {4,5} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {1,2,4} • Upper({1,2,4,8}) = {1,2,4,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV

  19. Parallell method, ”Do not care” values • ”Do not care” values belong to every block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,3,4,5,8} • [(Nausea, yes)] = {2,4,5,7,8} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4,5,8}, K(5) = {4,5,8} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {2,8} • Upper({1,2,4,8}) = {1,2,3,4,5,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV

  20. Rule Induction - MLEM2 algorithm, • Rules: trained to describe cases, induced from the decicion table • Possible rules, from upper approximation of a concept • Certain rules, from lower approximation of a concept • Missing values are lost: • Possible: (Temp, normal) -> (Flu, no) • (Headache, no) -> (Flu, no) • Certain: (Temp, high) & (Nausea, no) -> (Flu, yes) • (Headache, yes) & (Nausea,yes) -> (Flu, yes) HASCO 2014 - SHAH - OLOV

  21. KDD • Organized and automatized process of exploring patterns in large data sets • More general than Data Mining • The core of KDD is DM

  22. 9 steps • Iteraive & Interactive • Not yet: Best solution to each kind of problem at each step.

  23. Step 1: Understand and specify the goals of the end user • Preprocessing part: • Step 2: Select and create the data set • Step 3: Preprocessing and cleaning => enhance reliabiliy • Step 4: Data trasformation => better data for DM

  24. Data Mining part: • Step 5: Choosing the apprpriate DM task • Step 6: Choosing the DM algorithm, precision vs understandability • Step 7: Employing the DM algorithm • Step 8: Evaluate and interpret mined patterns

  25. Step 9: Using the discovered knowledge • Success of the entire KDD determined by this • Challenges, e.g. loosing lab conditions

  26. END HASCO 2014 - SHAH - OLOV

More Related