150 likes | 272 Vues
This work explores the challenges and methodologies of schema matching using structural analysis and natural language processing. It emphasizes the significance of schema in data integration, particularly for enterprises merging databases or for environmental data collection. The presentation outlines known methods like the Similarity Flooding Algorithm and discusses the approach to matching elements through semantic analysis and graph theory. Challenges such as multi-word terms and varying definitions are addressed, along with future work that aims to incorporate additional methodologies. ###
E N D
Schema Matching through Structural Analysis and Natural Language ProcessingMing XiaoSkidmore CollegeFaculty Mentor: Dr. Longzhuang Li
Outline • Background • Importance • Known Methods • Approach • Challenges • Future Work • Acknowledgements • Questions
Schema Matching • Schema = Description of a Table • Match elements that are related
Importance • Enterprise • Merging two companies, one database • Environmental Data Collection • Merging data to provide overall picture • Storm Tracking
Structural Matching • Similarity Flooding Algorithm • Neighbors are similar to each other A = Alex, Aly B = Ben, Beth C = Carl, Cam Ben, Beth A B Carl, Cam Alex, Aly C S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching
Table to Graph Client CID CName First Name Last Name Name of Table is the head of the graph Customer Company CustID Contact Phone
The Neighborhood • Cross product of elements in original graphs Client, Customer Client Customer Last Name CName Client, Company CID CustID First Name Phone Company Client, CustID Contact Client, Contact • Choose similar pairs Client, Phone ...
Determining Pairs • Semantic • Determine correlation between two definitions • Client → {person, pays, services, goods, seeks, advice, lawyer, ...} • Customer → {someone, pays, goods, services} • Company → {institution, conduct, business, ...} • Cosine Similarity, value between 0 and 1 Client, Customer Client, Company • Letter Pair • CustID, CID • CustID → {Cu, us, st, tI, ID} • CID → {CI, ID} • Phone → {Ph, ho, on, ne} • Letter Pair Similarity CustID, CID CustID, Phone
Neighbors N • Each node, N, is a pair (x,y) • One element from each graph Client, Customer • There is an edge N → N' • x → x' and y → y' CID, CustID N'
Challenges • Multi-word terms • Air Temperature • Air → {mixture, gases, oxygen, required, breathing, ...} • Temperature → {degree, hotness, coldness, body, ...} • Similar meaning, different defining words • Hurricane vs. Cyclone • Hurricane → {severe, heavy, rain, ...} • Cyclone → {violent, windstorm, ...}
Future Work • Incorporate other methods • Gender, Sex → {M,F}
References E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching
Acknowledgements • Dr. Longzhuang Li • Dr. Dulal Kar • Dr. Ahmed Mahdy • Huy Tran • National Science Foundation