1 / 47

ADVENTURES IN DATA MINING

ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

makani
Télécharger la présentation

ADVENTURES IN DATA MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;eamonn@cs.ucr.edu

  2. The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

  3. Data Mining Outline • Introduction • Techniques • Classification • Clustering • Association Rules • Examples Explore some interesting data mining applications

  4. Introduction • Data is growing at a phenomenal rate • Users expect more sophisticated information • How? UNCOVER HIDDEN INFORMATION DATA MINING

  5. But it isn’t Magic • You must know what you are looking for • You must know how to look for you Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner

  6. Description Behavior Associations “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.” Classification Clustering Link Analysis (Profiling) (Similarity)

  7. CLASSIFICATION Assign data into predefined groups or classes.

  8. x <90 >=90 x A <80 >=80 x B <70 >=70 x C <50 >=60 D F Classification Ex: Grading

  9. Katydids Given a collection of annotated data. (in this case 5 instancesof Katydidsand five ofGrasshoppers), decide what type of insect the unlabeled example is. Grasshoppers (c) Eamonn Keogh, eamonn@cs.ucr.edu

  10. The classification problem can now be expressed as: • Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh, eamonn@cs.ucr.edu

  11. 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Antenna Length Abdomen Length Katydids Grasshoppers (c) Eamonn Keogh, eamonn@cs.ucr.edu

  12. Facial Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu

  13. 1 0.5 0 50 100 150 200 250 300 350 400 450 0 Handwriting Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu George Washington Manuscript

  14. Rare Event Detection

  15. Dallas Morning News October 7, 2005

  16. CLUSTERING Partition data into previously undefined groups.

  17. http://149.170.199.144/multivar/ca.htm

  18. What is Similarity? (c) Eamonn Keogh, eamonn@cs.ucr.edu

  19. Two Types of Clustering Partitional Hierarchical (c) Eamonn Keogh, eamonn@cs.ucr.edu

  20. Hierarchical Clustering ExampleIris Data Set Versicolor Setosa Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .

  21. ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data

  22. ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1

  23. Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.

  24. Data Mining Outline • Introduction • Techniques • Examples • Vision Mining • Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) • Bioinformatics

  25. Vision Mining • License Plate Recognition • Red Light Cameras • Toll Booths • http://www.licenseplaterecognition.com/ • Computer Vision • http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/vid/

  26. How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm

  27. Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

  28. No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

  29. Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

  30. Jialun Qin, Jennifer J. Xu, DaningHu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

  31. http://www.time.com/time/magazine/article/0,9171,1541283,00.htmlhttp://www.time.com/time/magazine/article/0,9171,1541283,00.html

  32. DNA http://www.visionlearning.com/library/module_viewer.php?mid=63 Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together

  33. DNA transcription RNA translation Protein Central Dogma: DNA -> RNA -> Protein CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA Amino Acid www.bioalgorithms.info; chapter 6; Gene Prediction

  34. Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY? Almost identical to that of Chimps. What makes the difference? Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

  35. RNAi – Nobel Prize in Medicine 2006 siRNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3

  36. miRNA • Short (20-25nt) sequence of noncoding RNA • Known since 1993 but significance not widely appreciated until 2001 • Impact / Prevent translation of mRNA • Generally reduce protein levels without impacting mRNA levels (animal cells) • Functions • Causes some cancers • Guide embryo development • Regulate cell Differentiation • Associated with HIV • …

  37. C Elegans Homo Sapiens Mus Musculus All Mature ACG CGC GCG UCG TCGR – Mature miRNA(Window=5; Pattern=3)

  38. TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

  39. Affymetrix GeneChip® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

  40. Microarray Data Analysis • Each probe location associated with gene • Measure the amount of mRNA • Color indicates degree of gene expression • Compare different samples (normal/disease) • Track same sample over time • Questions • Which genes are related to this disease? • Which genes behave in a similar manner? • What is the function of a gene? • Clustering • Hierarchical • K-means

  41. Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004

  42. BIG BROTHER ? • Total Information Awareness • http://infowar.net/tia/www.darpa.mil/iao/index.htm • http://www.govtech.net/magazine/story.php?id=45918 • http://en.wikipedia.org/wiki/Information_Awareness_Office • Terror Watch List • http://www.businessweek.com/technology/content/may2005/tc20050511_8047_tc_210.htm • http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ • http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html • CAPPS • http://www.theregister.co.uk/2004/04/26/airport_security_failures/ • http://www.heritage.org/Research/HomelandDefense/BG1683.cfm • http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ • http://en.wikipedia.org/wiki/CAPPS

  43. http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

  44. Thanks!

More Related