Data Integration Techniques for Improved Analysis and Understanding

Patients.txt • Variable • Name Description Type Valid Values • PATNO Patient Number Character Numerals • GENDER Gender Character ‘M' or 'F' • VISIT Visit Date MMDDYY10 Any valid date • HR Heart Rate Numeric 40 to 100 • SBP Systolic Blood Pres. Numeric 80 to 200 • DBP Diastolic Blood Pres. Numeric 60 to 120 • DX Diagnosis Code Character 1 to 3 digits • AE Adverse Event Character '0' or '1'

Patients.txt

Patients.txt • Variable • Name Description Type Valid Values • PATNO Patient Number Character Numerals • GENDER Gender Character ‘M' or 'F' • VISIT Visit Date MMDDYY10 Any valid date • HR Heart Rate Numeric 40 to 100 • SBP Systolic Blood Pres. Numeric 80 to 200 • DBP Diastolic Blood Pres. Numeric 60 to 120 • DX Diagnosis Code Character 1 to 3 digits • AE Adverse Event Character '0' or '1'

Distribution

Some of Invalid value

HR - Heart Rate (BETWEEN 40 AND 100) • SBP - systolic Blood Pressure (BETWEEN 80 AND 200) • DBP - Diastolic Blood Pressure (Between 60 to 120)

DBP - Diastolic Blood Pressure (Between 60 to 120)

SBP - systolic Blood Pressure (BETWEEN 80 AND 200)

HR - Heart Rate (BETWEEN 40 AND 100)SBP

Data integration • combining/merging data from heterogeneous data sources. • is the process of combining data residing at different sources (internal data sources and external data sources) • providing the user with a unified view of these data.

SCHEMA INTEGRATION • use different representations or definitions of schema but it refers to or represent the same information. • as the entity identification problem.

For example • How can we identify that customer_id in one data set and customer_no in another refer to the same entity?

Schema matching • Currently, most of the schema matching is done manually. • tedious, • time-consuming, • error-prone.

We need automated support for schema matching • faster, • error-free and • less labor-intensive.

A mapping between Global Schema and Local Schema

The architecture for data integration

Correlation Analysis • Redundancy • apply correlation analysis

Correlation Analysis • Given two attributes (X1, X2); • Measure the correlation of one attribute (X1) to another attribute (X2).

Correlation Analysis

Correlation Analysis • Table 2 is generated by the following criteria: • i) For the number of bytes in the attributes, if total number of bytes is less than or equal to 8 byte, we put it as 1, else it would be 0. • ii) For 1 attribute frequently access, we propose to sum the total frequency of one attribute, which is (6 1+2) = 9. The average frequently accessed = 9 / 3 = 3. Any number which is less than average frequently accessed, would be converted into 0, else it is 1.

Correlation Analysis • We apply correlation analysis to find out among attributes where are pairs as a redundancy.

Correlation Analysis • If the resulting value is greater than 0, then X2 and X3 are positively correlated. • The higher the value (approaching 1), the more each attribute implies the other. • Therefore, it is recommended that X2 (or X3 ) may be removed as they are redundant variables.

Clustering • To explain how we apply a clustering algorithm to generate clusters, • we assume that a relation has 10 attributes involved in query processing. • Furthermore, one disk page can only take less than 100 bytes

Clustering • Table 6.1 shows the length of each attributes. • We use a frequent access table to keep track the number of times users access in a particular relation as shown in Table 6.1. • When the users access the relation, the frequent access table will be updated. The frequent access table also shows the length of attribute.

Clustering

Clustering • From Table 6.1, we would like covert those numeric figures into Y or N condition based on some criteria. • We propose the following converting scheme: • For number of bytes in the attributes, if total number bytes less than one fetch of instruction cycle way 100 byte, we put it as Y else it would be N. • For 1 attribute frequently access, we propose to sum the total frequent of one attribute which is (7 + 2 + 4 + 3 + 2 + 8 + 5 + 4 + 9 + 3) = 47. • The average frequently access = 47 / 10 = 4.7. • Any number is less than average frequently access, we would like to convert it into N else it is Y.

Clustering

DATA TRANSFORMATION • In metadata, a data transformation converts data from a source data format into destination data.

Data Integration Techniques for Improved Analysis and Understanding

Data Integration Techniques for Improved Analysis and Understanding

Presentation Transcript