1 / 13

Semantic Integration in Heterogeneous Databases Using Neural Networks

Semantic Integration in Heterogeneous Databases Using Neural Networks. Wen-Syan Li, Chris Clifton Presentation by Jeff Roth. Introduction. Basic schema matching problem GTE’s data integration project included 27,000 data elements

wray
Télécharger la présentation

Semantic Integration in Heterogeneous Databases Using Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth

  2. Introduction • Basic schema matching problem • GTE’s data integration project included 27,000 data elements • This took 4 hours per data element or 25 full time employees 2 years to complete • This method -> .1 seconds, 144000 x faster • “how to match knowledge is discovered”

  3. Method Outline “The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”

  4. Automated semantic integration methods • Attribute Name Comparison This method is not used in this paper • Attribute values and domains comparison Equal, Contains, Overlap, Contained-in and Disjoint Used but not with the above measures • Field Specifications Data type, field length constraints and others. This is also used in this method

  5. Field Specifications The following measures are used • data types Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0 • field length Length = 2 * (1/(1 + k-length) - 0.5) • format specifications similar to data type • constraints (primary key, foreign key, disallowing nulls, access restrictions, etc…) similar to data type

  6. Attribute Values and Domains Divide measures into character fields and numeric fields • Patterns for Character fields 1. Ratio of numerical characters Address: 146 South 920 West would score 6/18 2. Ratio of white space Address: 146 South 920 West would score 3/18 3. Length Statistics Average, Variance, and coefficient of the “used” length relative to the maximum length

  7. Attribute Values and Domains cont. • Patterns for numeric fields 1. Average (mean) 2. Variance 3. Coefficient of variation Recognizes similarity between values of different Units and Granularity This can also help recognize which fields may need unit conversions 4. Grouping For example: area code, zip code, first three digits of SSN

  8. Self-Organizing Grouping algorithm • N = number of possible discriminators • M = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|” • This is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity

  9. Training the Back-Prop Network • Inputs (N) are identical to classifier • Outputs (M) are trained using Back-Propagation and classifier’s results • Categories are labeled with the attributes they grouped together*

  10. What is the classifier for? • Ease of training: “ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys| • It is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys| • Networks can be trained in which there there are attributes that are identical

  11. Integration Procedure 1 2 3 1. DBMS Specific Parser 2. Classify (Categorize) Training Data 3. Train Neural Network 4. DBMS Specific Parser 5. Classification by Neural Network 6. User Checks Results 6 4 5

  12. Results

  13. Conclusion and Future Work • Human Effort needed for semantic integration is minimized • Different Systems have different attribute properties available - automated solution • Extend to automated information integration • C source code available at eecs.nwu.edu/pub/semint

More Related