1 / 109

Data Analysis and Mining

Data Analysis and Mining. Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 7, October 18, 2011. Reading assignment. Brief Introduction to Data Mining Longer Introduction to Data Mining and slide sets Software resources list Data Analysis Tutorial Example: Data Mining. Contents.

gmaxfield
Télécharger la présentation

Data Analysis and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis and Mining Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 7, October 18, 2011

  2. Reading assignment • Brief Introduction to Data Mining • Longer Introduction to Data Mining and slide sets • Software resources list • Data Analysis Tutorial • Example: Data Mining

  3. Contents • Preparing for data analysis, completing and presenting results • Visualization as an information tool • Visualization as an analysis tool • New visualization methods (new types of data) • Managing the output of viz/ analysis • Enabling discovery • Use, citation, attribution and reproducability

  4. Contents • Data Mining what it is, is not, types • Distributed applications – modern data mining • Science example • A specific toolkit set of examples (next week) • Classifier • Image analysis – clouds • Assignment 3 and week 7 reading • Week 8

  5. Types of data

  6. Data types • Time-based, space-based, image-based, … • Encoded in different formats • May need to manipulate the data, e.g. • In our Data Mining tutorial and conversion to ARFF • Coordinates • Units • Higher order, e.g. derivative, average

  7. Induction or deduction? • Induction: The development of theories from observation • Qualitative – usually information-based • Deduction: The testing/application of theories • Quantitative – usually numeric, data-based

  8. ‘Signal to noise’ • Understanding accuracy and precision • Accuracy • Precision • Affects choices of analysis • Affects interpretations (gigo) • Leads to data quality and assurance specification • Signal and noise are context dependent

  9. Other considerations • Continuous or discrete • Underlying reference system • Oh yeah: metadata standards and conventions • The underlying data structures are important at this stage but there is a tendency to read in partial data • Why is this a problem? • How to ameliorate any problems?

  10. Outlier • An extreme, or atypical, data value(s) in a sample. • They should be considered carefully, before exclusion from analysis. • For example, data values maybe recorded erroneously, and hence they may be corrected. • However, in other cases they may just be surprisingly different, but not necessarily 'wrong'.

  11. Special values in data • Fill value • Error value • Missing value • Not-a-number • Infinity • Default • Null • Rational numbers

  12. Errors • Three main types: personal error, systematic error, and random error • Personal errors are mistakes on the part of the experimenter. It is your responsibility to make sure that there are no errors in recording data or performing calculations • Systematic errors tend to decrease or increase all measurements of a quantity, (for instance all of the measurements are too large). E.g. calibration

  13. Errors • Random errors are also known as statistical uncertainties, and are a series of small, unknown, and uncontrollable events • Statistical uncertainties are much easier to assign, because there are rules for estimating the size • E.g. If you are reading a ruler, the statistical uncertainty is half of the smallest division on the ruler. Even if you are recording a digital readout, the uncertainty is half of the smallest place given. This type of error should always be recorded for any measurement

  14. Standard measures of error • Absolute deviation • is simply the difference between an experimentally determined value and the accepted value • Relative deviation • is a more meaningful value than the absolute deviation because it accounts for the relative size of the error. The relative percentage deviation is given by the absolute deviation divided by the accepted value and multiplied by 100% • Standard deviation • standard definition

  15. Standard deviation • the average value is found by summing and dividing by the number of determinations. Then the residuals are found by finding the absolute value of the difference between each determination and the average value. Third, square the residuals and sum them. Last, divide the result by the number of determinations - 1 and take the square root.

  16. Propagating errors • This is an unfortunate term – it means making sure that the result of the analysis carries with it a calculation (rather than an estimate) of the error • E.g. if C=A+B (your analysis), then ∂C=∂A+∂B • E.g. if C=A-B (your analysis), then ∂C=∂A+∂B! • Exercise – it’s not as simple for other calcs. • When the function is not merely addition, subtraction, multiplication, or division, the error propagation must be defined by the total derivative of the function.

  17. Types of analysis • Preliminary • Detailed • Summary • Reporting the results and propagating uncertainty • Qualitative v. quantitative, e.g. see http://hsc.uwe.ac.uk/dataanalysis/index.asp

  18. What is preliminary analysis? • Self-explanatory…? • Down sampling…? • The more measurements that can be made of a quantity, the better the result • Reproducibility is an axiom of science • When time is involved, e.g. a signal – the ‘sampling theorem’ – having an idea of the hypothesis is useful, e.g. periodic versus aperiodic or other… • http://en.wikipedia.org/wiki/Nyquist–Shannon_sampling_theorem

  19. Detailed analysis • Most important distinction between initial and the main analysis is that during initial data analysis it refrains from any analysis. • Basic statistics of important variables • Scatter plots • Correlations • Cross-tabulations • Dealing with quality, bias, uncertainty, accuracy, precision limitations - assessing • Dealing with under- or over-sampling • Filtering, cleaning

  20. Summary analysis • Collecting the results and accompanying documentation • Repeating the analysis (yes, it’s obvious) • Repeating with a subset • Assessing significance, e.g. the confusion matrix we used in the supervised classification example for data mining, p-values (null hypothesis probability)

  21. Reporting results/ uncertainty • Consider the number of significant digits in the result which is indicative of the certainty of the result • Number of significant digits depends on the measuring equipment you use and the precision of the measuring process - do not report digits beyond what was recorded • The number of significant digits in a value infers the precision of that value

  22. Reporting results… • In calculations, it is important to keep enough digits to avoid round off error. • In general, keep at least one more digit than is significant in calculations to avoid round off error • It is not necessary to round every intermediate result in a series of calculations, but it is very important to round your final result to the correct number of significant digits.  

  23. Uncertainty • Results are usually reported as result ± uncertainty (or error) • The uncertainty is given to one significant digit, and the result is rounded to that place • For example, a result might be reported as 12.7 ± 0.4 m/s2. A more precise result would be reported as 12.745 ± 0.004 m/s2. A result should not be reported as 12.70361 ± 0.2 m/s2 • Units are very important to any result

  24. Secondary analysis • Depending on where you are in the data analysis pipeline (i.e. do you know?) • Having a clear enough awareness of what has been done to the data (either by you or others) prior to the next analysis step is very important – it is very similar to sampling bias • Read the metadata (or create it) and documentation

  25. Tools • 4GL • Matlab • IDL • Ferret • NCL • Many others • Statistics • SPSS • Gnu R • Excel • What have you used?

  26. Considerations for viz. as analysis • What is the improvement in the understanding of the data as compared to the situation without visualization? • Which visualization techniques are suitable for one's data? • E.g. Are direct volume rendering techniques to be preferred over surface rendering techniques?

  27. Why visualization? • Reducing amount of data, quantization • Patterns • Features • Events • Trends • Irregularities • Leading to presentation of data, i.e. information products • Exit points for analysis

  28. Types of visualization • Color coding (including false color) • Classification of techniques is based on • Dimensionality • Information being sought, i.e. purpose • Line plots • Contours • Surface rendering techniques • Volume rendering techniques • Animation techniques • Non-realistic, including ‘cartoon/ artist’ style

  29. Compression (any format) • Lossless compression methods are methods for which the original, uncompressed data can be recovered exactly. Examples of this category are the Run Length Encoding, and the Lempel-Ziv Welch algorithm. • Lossy methods - in contrast to lossless compression, the original data cannot be recovered exactly after a lossy compression of the data. An example of this category is the Color Cell Compression method. • Lossy compression techniques can reach reduction rates of 0.9, whereas lossless compression techniques normally have a maximum reduction rate of 0.5.

  30. Remember - metadata • Many of these formats already contain metadata or fields for metadata, use them!

  31. Tools • Conversion • Imtools • GraphicConverter • Gnu convert • Many more • Combination/Visualization • IDV • Matlab • Gnuplot • http://disc.sci.gsfc.nasa.gov/giovanni

  32. New modes • http://www.actoncopenhagen.decc.gov.uk/content/en/embeds/flash/4-degrees-large-map-final • http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approaches/ • Many modes: • http://www.siggraph.org/education/materials/HyperVis/domik/folien.html

  33. Periodic table

  34. Publications, web sites • www.jove.com - Journal of Visualized Experiments • www.visualizing.org - • logd.tw.rpi.edu -

  35. Managing visualization products • The importance of a ‘self-describing’ product • Visualization products are not just consumed by people • How many images, graphics files do you have on your computer for which the origin, purpose, use is still known? • How are these logically organized?

  36. (Class 2) Management • Creation of logical collections • Physical data handling • Interoperability support • Security support • Data ownership • Metadata collection, management and access. • Persistence • Knowledge and information discovery • Data dissemination and publication

  37. Use, citation, attribution • Think about and implement a way for others (including you) to easily use, cite, attribute any analysis or visualization you develop • This must include suitable connections to the underlying (aka backbone) data – and note this may not just be the full data set! • Naming, logical organization, etc. are key • Make them a resource, e.g. URI/ URL

  38. Producability/ reproducability • The documentation around procedures used in the analysis and visualization are very often neglected – DO NOT make this mistake • Treat this just like a data collection (or generation) exercise • Follow your management plan • Despite the lack or minimal metadata/ metainformation standards, capture and record it • Get someone else to verify that it works

  39. Data Mining – What it is • Extracting knowledge from large amounts of data • Motivation • Our ability to collect data has expanded rapidly • It is impossible to analyze all of the data manually • Data contains valuable information that can aid in decision making • Uses techniques from: • Pattern Recognition • Machine Learning • Statistics • High Performance Database Systems • OLAP • Plus techniques unique to data mining (Association rules) • Data mining methods must be efficient and scalable

  40. Data Mining – What it isn’t • Small Scale • Data mining methods are designed for large data sets • Scale is one of the characteristics that distinguishes data mining applications from traditional machine learning applications • Foolproof • Data mining techniques will discover patterns in any data • The patterns discovered may be meaningless • It is up to the user to determine how to interpret the results • “Make it foolproof and they’ll just invent a better fool” • Magic • Data mining techniques cannot generate information that is not present in the data • They can only find the patterns that are already there

  41. Data Mining – Types of Mining • Classification (Supervised Learning) • Classifiers are created using labeled training samples • Training samples created by ground truth / experts • Classifier later used to classify unknown samples • Clustering (Unsupervised Learning) • Grouping objects into classes so that similar objects are in the same class and dissimilar objects are in different classes • Discover overall distribution patterns and relationships between attributes • Association Rule Mining • Initially developed for market basket analysis • Goal is to discover relationships between attributes • Uses include decision support, classification and clustering • Other Types of Mining • Outlier Analysis • Concept / Class Description • Time Series Analysis

  42. Data Mining in the ‘new’ Distributed Data/Services Paradigm

  43. Science Motivation • Study the impact of natural iron fertilization process such as dust storm on plankton growth and subsequent DMS production • Plankton plays an important role in the carbon cycle • Plankton growth is strongly influenced by nutrient availability (Fe/Ph) • Dust deposition is important source of Fe over ocean • Satellite data is an effective tool for monitoring the effects of dust fertilization

  44. Hypothesis • In remote ocean locations there is a positive correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration • There is a time lag between oceanic dust deposition and the photosynthetic activity

  45. Primary source of ocean nutrients OCEAN UPWELLING WIND BLOWNDUST SAHARA SEDIMENTS FROM RIVER

  46. CLOUDS Factors modulating dust-ocean photosynthetic effect SST CHLOROPHYLL DUST NUTRIENTS SAHARA

  47. Objectives • Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated. • Determine physical processes responsible for observed relationship

  48. Data and Method • Data sets obtained from SeaWiFS and MODIS during 2000 – 2006 are employed • MODIS derived AOT • SeaWIFS • MODIS • AOT

  49. The areas of study *Figure: annual SeaWiFS chlorophyll image for 2001 8 7 6 1 2 5 3 4 1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3-Patagonia 4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean

  50. Tropical North Atlantic Ocean  dust from Sahara Desert -0.17504 -0.0902 -0.328 -0.4595 -0.14019 -0.7253 -0.1095 Chlorophyll AOT -0.68497 -0.15874 -0.85611 -0.4467 -0.75102 -0.66448 -0.72603

More Related