1 / 16

Data Visualization & Exploration

Data Visualization & Exploration

dwikatmo
Télécharger la présentation

Data Visualization & Exploration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. بيقنتلاو بيقنتلاو يف يف WA AR RE EH HO OU US SI IN NG G & &D DA AT TA A M MI IN NI IN NG G حاجنلا حاجنلا ةينطولا ةينطولا - - نيطسلف نيطسلف A AN N N NA AJ JA AH H N NA AT TI IO ON NA AL L U UN NI IV VE ER RS SI IT TY Y M MA AN NA AG GE EM ME EN NT T I IN NF FO OR RM MA AT TI IO ON N S SY YS ST TE EM MS S D DE EP PA AR RT TM ME EN NT T P PR RE EP PA AR RE ED D B BY Y: : M MO OH HA AM MM ME ED D A AB BD DE EL L K KH HA AL LE EQ Q D DW 1 18 8/ /0 09 9/ /2 20 01 17 7 تانايبلا تانايبلا D DA AT TA A W تانايبلا تانايبلا تاعدوتسم تاعدوتسم ةعماج ةعماج WI IK KA AT T D DA AT TA A V VI IS SU UA AL LI IZ ZA AT TI IO ON N T TE EC CH HN NI IQ QU UE ES S F FR RE EQ QU UE EN NC CI IE ES S T TA AB BL LE ES S, ,C CR RO OS SS S T TA AB BU UL LA AT TI IO ON N S SU UM MM MA AR RI IE ES S F FO OR R C CE EN NT TR RA AL L T TE EN ND DE EN NC CY Y A AN ND D D DI IS SP PE ER RS SI IO ON N P PI IE E, ,B BA AR R, ,H HI IS ST TO OG GR RA AM M C CH HA AR RT TS S L LI IN NE ES S, , S SC CA AT TT TE ER R, , B BO OX XP PL LO OT TS S AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:1 OF 16

  2. V VI IS SU UA AL LI IZ ZA AT TI IO ON N Visualization is defined in the dictionary as "a mental image". It Is to display of information in graphical or tabular format, it helps people to quickly absorb large amounts of visual information and find patterns in it. Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns Example: Sea Surface Temperature  The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a single figure AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:2 OF 16

  3. R RE EP PR RE ES SE EN NT TA AT TI IO ON N Is the mapping of information to a visual format Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors. Example:  Objects are often represented as points  Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape  If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived. A AR RR RA AN NG GE EM ME EN NT T  Is the placement of visual elements within a display  Can make a large difference in how easy it is to understand the data Example: S SE EL LE EC CT TI IO ON N Is the elimination or the de-emphasis of certain objects and attributes, it may involve the choosing a subset of attributes – Dimensionality reduction is often used to reduce the number of dimensions to two or three – Alternatively, pairs of attributes can be considered Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse areas AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:3 OF 16

  4. V VI IS SU UA AL LI IZ ZA AT TI IO ON N T TE EC CH HN NI IQ QU UE ES S Visualization techniques can be classified in a number of ways. They can be classified as to whether their focus is information (geometric) or scientific (symbolic- physical data) , whether the stimulus is 2D, 3D, or n-D, or whether the displayis static or dynamic, and whether the objective is exploratory, confirmatory or manipulative. T TH HR RE EE E T TY YP PE ES S O OF F V VI IS SU UA AL LI IZ ZA AT TI IO ON N A AC CC CO OR RD DI IN NG G T TO O O OB BJ JE EC CT TI IV VE ES S A. E EX XP PL LO OR RA AT TO OR RY Y V VI IS SU UA AL LI IZ ZA AT TI IO ON NS S the user does not necessarily know what s/he is looking for B B. .C CO ON NF FI IR RM MA AT TO OR RY Y V VI IS SU UA AL LI IZ ZA AT TI IO ON NS S the user has a hypothesis that needs only to be tested C C. .M MA AN NI IP PU UL LA AT TI IV VE E ( (P PR RO OD DU UC CT TI IO ON N) ) V VI IS SU UA AL LI IZ ZA AT TI IO ON NS S the user has a validated hypothesis and so knows exactly what is to be presented. Therefore, he focuses on refining the visualization to optimize the presentation Visualization techniques are divided roughly into two classes according to focus A A. .S SC CI IE EN NT TI IF FI IC C V VI IS SU UA AL LI IZ ZA AT TI IO ON N focuses primarily on physical data such as the human body, the earth, molecules, and so on. Scientific visualization also deals with multidimensional data, but most of the data sets used in this field use the spatial attributes of the data for visualization purposes; e.g., Computer-Aided Tomography(CAT) and Computer-Aided Design(CAD). Also, many of the Geographical Information Systems (GIS) use either the Cartesian coordinate system or some modified geographical coordinates to achieve a reasonable visualization of the data. B B. .I IN NF FO OR RM MA AT TI IO ON N V VI IS SU UA AL LI IZ ZA AT TI IO ON N Focuses on abstract, nonphysical data such as text, hierarchies, and statistical data. Data-mining techniques are primarily oriented toward information visualization. The challenge for nonphysical data is in designing a visual representation of multidimensional samples (where the number of dimensions is greater than three). Multidimensional-information visualizations present data that is not primarily plenary or spatial. One-, two-, and three-dimensional, but also temporal information- visualization schemes can be viewed as a subset of multidimensional information visualization. One approach is to map the nonphysical data to a virtual object such as a cone tree, which can be manipulated as if it were a physical object. Another AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:4 OF 16

  5. approach is to map the nonphysical data to the graphical properties of points, lines, and areas. A star display for data on seven quality-of-life measures for three states AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:5 OF 16

  6. I IN NF FO OR RM MA AT TI IO ON N V VI IS SU UA AL LI IZ ZA AT TI IO ON N For discrete data values most suitable is frequencies tables (counts) and percentages 1 1. .T TA AB BU UL LA AR R F FO OR RM MA AT T Frequencies, crosstab Example GenderFrequencies Frequency Percent Valid Percent Cumulative Percent Male 100 50.0 50.0 50.0 Valid Female 100 50.0 50.0 100.0 100.0 Total 200 100.0 Gender * Result Crosstabulation Result Total Pass Fail Count 88 12 100 % within Gender 88.0% 12.0% 100.0% Male % within Result 48.6% 63.2% 50.0% % of Total 44.0% 6.0% 50.0% Gender Count 93 7 100 % within Gender 93.0% 7.0% 100.0% Female % within Result 51.4% 36.8% 50.0% % of Total 46.5% 3.5% 50.0% Count 181 19 200 % within Gender 90.5% 9.5% 100.0% Total % within Result 100.0% 100.0% 100.0% % of Total 90.5% 9.5% 100.0% AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:6 OF 16

  7. For continuous data values ( and even discrete scale data) most suitable is A A. .C CE EN NT TR RA AL L T TE EN ND DE EN NC CY Y M ME EA AS SU UR RE ES S Measures as Arithmetic mean, Geometric Mean, Harmonic Mean, Mode, Median, Max Value, Min Value B B. .D DI IS SP PE ER RS SI IO ON N Measures as Variance, Standard Deviation, range (Max –Min)… 2 2. .C CH HA AR RT TS S For discrete data values, most suitable charts are pie charts, bar charts and Histogram could be used either AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:7 OF 16

  8. H HI IS ST TO OG GR RA AM MS S: : What is histogram: it is a visual representation that Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data –Usually shows the distribution of values of a single variable –Divide the values into bins and show a bar plot of the number of objects in each bin. –The height of each bar indicates the number of objects –Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively) T TW WO O- -D DI IM ME EN NS SI IO ON NA AL L H HI IS ST TO OG GR RA AM MS S  Show the joint distribution of the values of two attributes  Example: petal width and petal length – What does this tell us? AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:8 OF 16

  9. B BO OX X P PL LO OT TS S – Invented by J. Tukey – Another way of displaying the distribution of data – Following figure shows the basic part of a box plot  Box plots can be used to compare attributes AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:9 OF 16

  10. For boxplots: dark line in the middle represents the median, button represents 25th percentile, top represents 75th percentile (25% Below box, 25% above it) This means that 50% of the case/rows lie within the box Short box: means values less variant Tall box: means values vary a lot T-bars that extend from the boxes are called inner fences: If the data are distributed normally, approximately 95% or the data are expected to lie between the inner fences The points with O are outliers. These are defined as values that do not fall in the inner fences. extreme values. The asterisks or stars * are extreme outliers. These represent cases/rows that have values more than three times the height of the boxes. AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:10 OF 16

  11. For continuous data values, most suitable graphs are lines, areas, scatter, boxplots, etc.. line graph for Y = X2 line Box Plots for Age attribute Boxplot Scatter plots Scatter plots Iteration 5 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x S SC CA AT TT TE ER R P PL LO OT TS S  Attributes values determine the position  Two-dimensional scatter plots most common, but can have three-dimensional scatter plots  Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects  It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes  See next example AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:11 OF 16

  12. S SC CA AT TT TE ER R P PL LO OT T A AR RR RA AY Y O OF F I IR RI IS S A AT TT TR RI IB BU UT TE ES S C CO ON NT TO OU UR R P PL LO OT TS S  Useful when a continuous attribute is measured on a spatial grid  They partition the plane into regions of similar values  The contour lines that form the boundaries of these regions connect points with equal values  The most common example is contour maps of elevation  Can also display temperature, rainfall, air pressure, etc.  An example for Sea Surface Temperature (SST) is provided C CO ON NT TO OU UR R P PL LO OT T E EX XA AM MP PL LE E: :S SS ST TD DE EC C, ,1 19 99 98 8 AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:12 OF 16

  13. M MA AT TR RI IX X P PL LO OT TS S  Can plot the data matrix  This can be useful when objects are sorted according to class  Typically, the attributes are normalized to prevent one attribute from dominating the plot  Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects  Examples of matrix plots P PA AR RA AL LL LE EL L C CO OO OR RD DI IN NA AT TE ES S  Used to plot the attribute values of high-dimensional data  Instead of using perpendicular axes, use a set of parallel axes  The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line  Thus, each object is represented as a line  Often, the lines representing a distinct class of objects group together, at least for some attributes  Ordering of attributes is important in seeing such groupings AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:13 OF 16

  14. S ST TA AR R P PL LO OT TS S  Similar approach to parallel coordinates, but axes radiate from a central point  The line connecting the values of an object is a polygon AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:14 OF 16

  15. G GO OO OD D A AN ND D B BA AD D V VI IS SU UA AL LI IZ ZA AT TI IO ON N B BA AD D V VI IS SU UA AL LI IZ ZA AT TI IO ON N: :Spreadsheet with misleading Y –axis Year 1999 2000 Sales 2,11 0 5 B BA AD D V VI IS SU UA AL LI IZ ZA AT TI IO ON N Y-Axis scale gives WRONG impression of big change 2001 2,12 2002 2,12 2003 2,12 2,10 0 1 4 Sales 2130 2125 2120 2115 Sales 2110 2105 2100 2095 1999 2000 2001 2002 2003 B BE ET TT TE ER R V VI IS SU UA AL LI IZ ZA AT TI IO ON N Axis from 0 to 2000 scale gives correct impression of small change Sales 3000 2500 2000 1500 Sales 1000 500 0 1999 2000 2001 2002 2003 AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:15 OF 16

  16. L LI IE E F FA AC CT TO OR R Tufte requirement: 0.95<Lie Factor<1.05 T TU UF FT TE E’ ’S S P PR RI IN NC CI IP PL LE ES S O OF F G GR RA AP PH HI IC CA AL L E EX XC CE EL LL LE EN NC CE E : : S SH HO OU UL LD D T TE EL LL L T TH HE E T TR RU UT TH H A AB BO OU UT T T TH HE E D DA AT TA A! ! B BY Y G GI IV VI IN NG G T TH HE E V VI IE EW The greatest number of ideas In the shortest time With the least ink in the smallest space. B BI IB BL LI IO OG GR RA AP PH HY Y (LAROSE, 2005) Works Cited David, O. L., & Delen, D. (2008). Advanced Data Mining Techniques. Springer-Verlag Berlin Heidelberg. http://changingminds.org/. (2011, 01 01). Retrieved 01 01, 2011, from http://changingminds.org/explanations/research/measurement/types_data.htm#nom Kantardzic, M. (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. LAROSE, D. T. (2005). DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining. New Jersey.: A JOHN WILEY & SONS. size of effect shown in graphic   Lie Factor size of effect in data 6 . 0  3 . 5 ( ) . 7 833 6 . 0     14 8 . ( 27 5 . 18 0 . ) . 0 528 18 WE ER R - - AUTHOR:MOHAMMED ABDUL KHALIQ DWIKATEMAIL:dwikatmo@gmail.com TOPIC:DATA MINING /DATA VISUALIZATION DATE:14/01/2019PAGE:16 OF 16

More Related