360 likes | 376 Vues
Explore schema matching, lexical annotation, & word sense disambiguation for effective data integration. Study conducted at International Doctorate School in Information and Communication Technologies, Università degli Studi di Modena e Reggio Emilia.
E N D
International Doctorate School in Information and Communication Technologies Università degli Studi di Modena e Reggio Emilia Label Normalization and Lexical Annotation for Schema and Ontology Matching Serena Sorrentino XXIII Cycle Computer Engineering and Science Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof. Sanda Harabagiu
Outline • Overview • Schema Matching • Lexical Annotation • The MOMIS Data Integration System • Open Problems and Contributions Semi-Automatic Lexical Annotation Schema Label Normalization Uncertainty in Automatic Annotation Conclusion & Future Work
Schema Matching - Definition • Schema matching is the task of finding the semantic correspondences (mappings) between elements of two schemata Match Result: is defined as a set of mapping elements each of which specifies that certain elements of S1 are mapped to certain elements of S2 Instance Information: used to characterize the content and semantics of schema elements Schema Information: element names, data types, constraints… Matcher Schema S1 Match Result Schema S2 Auxiliary Information Auxiliary Information: dictionaries, thesauri, user input … Output Input 3
Lexical Annotation for Schema Matching • DBGroup Approach: starting from “hidden” meanings associated to schema labels (i.e. class and attribute names, also called terms), it is possible to discover lexical relationships among schema elements • Lexical Annotation of schema labels is the explicit assignment of meanings w.r.t. a reference lexical thesaurus (WordNet in our case) • Lexical relationships (inter-schema knowledge): • SYN (Synonym-of) between two synonym terms • BT (Broader Term) between two terms where the first generalizes the second (the opposite is NT- Narrower Term) • RT(Related Term) between two terms that are generally used together in the same context • Schema derived relationships (intra-schema knowledge): • BT/NT ( from ISA relationships, and from Foreign Key (FK) in relational sources when it is a Primary Key in both the original and referenced relation) • RT (from nested elements in XML files and from FK in relational sources) [ S.Bergamaschi, S.Castano, M.Vincini, D.Beneventano. Semanticintegrationofheterogeneous information sources. DKE Journal, 2001]
Lexical Annotation - Example Lexical Annotation √ √ √ √ Lexical Relationship Discovery … … … Client#2 hypernym hyponym Client#3 holonym • SYN synonym in WordNet • BT/NT hypernym/hyponym WordNet relationship • RTmeronym relationship (part of) or sibling in WordNet Same Synset Customer#1 Client#1 meronymy SYN Customer Client
The MOMIS Data Integration System COMMON THESAURUS GENERATION SYNSET# SYNSET3 SCHEMA DERIVED RELATIONSHIPS SYNSET1 LEXICAL RELATIONSHIPS SYNSET2 Common Thesaurus clustersgeneration USER SUPPLIED RELATIONSHIPS INFERRED RELATIONSHIPS • The MOMIS System (Mediator EnvirOment for Multiple Information Sources) is an I3 framework designed for the integration of structured and semi-structured data sources GLOBAL SCHEMA GENERATION WRAPPING GLOBAL CLASSES LOCAL SCHEMA 1 <XML> <DATA> … LOCAL SCHEMA N RDB MAPPING TABLES MANUAL LEXICALANNOTATION AUTOMATIC LEXICAL ANNOTATION 6
Open Problems and Contributions: Automatic Lexical Annotation • Manual Annotation is a boring and not scalable task we need of a method to perform Automatic or Semi-automatic Annotation Non-Dictionary Words. i.e., Compound Nouns(CNs) , abbreviations, acronyms: need to normalize schema labels Customer CLIENT CLIENT_ID NAME ADDRESS STREET_ADDRESS … CITY COUNTRY PO_ID PURCHASE_ORDER Fully Automatic Annotation (i.e. “on-the-fly”) is intrinsically uncertaint: need of dealing with uncertain annotations PO_ID Order PRODUCT_CODE PRICE … … TSP_INFO … QTY INVOCE_NR … Schema S1 Schema S2
Outline • Overview • Schema Matching • Lexical Annotation • The MOMIS Data Integration System • Open Problems and Contributions Semi-Automatic Lexical Annotation Schema Label Normalization Uncertainty in Automatic Annotation Conclusion & Future Work
Word Sense Disambiguation for Semi-Automatic Lexical Annotation • WSD (Word Sense Disambiguation) is the ability of identifying the meanings of words in a context by a computational technique [R. Navigli, Word sense disambiguation: A survey. ACM Comput. Surv., 2009 ] • The semi-automatic CWSD(Combined Word Sense Disambiguation) method: • associates to each label, one/more WordNet meanings • combines two WSD algorithms: • SD (Structural Disambiguation)exploits the schema derived relationships • WND (WordNet domains Disambiguation) exploits WordNet Domains[B. Magnini, et al.,The role of domain information in Word Sense Disambiguation, Journal of Natural Language Engineering, 2002 ]
The CWSD method SOURCES 1 CWSD CLASS AND ATTRIBUTE NAMES EXTRACTION (Automatic Wrapping) SD Algorithm WND Algorithm Common Thesaurus INTEGRATION DESIGNER 1 Selects relevant domains SCHEMA DERIVED RELATIONSHIP EXTRACTION (Automatic Wrapping) 2 4 LEXICAL RELATIONSHIPS 3 ANNOTATED SCHEMATA A A A
Experimental Evaluation • We experimented CWSD over a real data set: three level of a subtree of the Yahoo and Google directories (“society and culture” and “society”, respectively) • Publications related to CWSD: • S.Bergamaschi, L.Po, S.Sorrentino. AutomaticAnnotation in Data IntegrationSystems. OTM Workshops 2007 • S.Bergamaschi, L.Po, A.Sala, S.Sorrentino.Data source annotation in data integrationsystems.DBISP2P 2007
Outline • Overview • Schema Matching • LexicalAnnotation • The MOMIS Data Integration System • Open Problems and Contributions Semi-Automatic Lexical Annotation Schema Label Normalization Uncertainty in Automatic Annotation Conclusion & Future Work
Schema Label Normalization • Schema label normalization: is the reduction of each label to some standardized form that can be easily recognized • In our case: the process of abbreviation expansion and CN (Compound Noun) annotation SYN SYN SYN PO PurchaseOrder PO PurchaseOrder SYN SYN SYN SYN SYN SYN SYN a- Discovered relationships without Schema normalization b- Discovered relationships with Schema normalization Legenda Right Relationship False Negative Relationship False Positive Relationship
The Schema LabelNormalizationmethod • We propose a semi-automaticschema labelnormalizationmethodwhichiscomposedbythreephases: • Selecting the labelstobenormalized • Tokenizinglabelsintoseparatedwords • Identifyingabbreviations and CNsamong the tokenizedwords Maciej Gawinecki’s presentation • InterpretingCNs • CreatingnewWordNetentries and meaningsfor the CNs
CN Annotation • Compound Noun (CN): is a term composed of two or more words called constituents • Endocentric CNs: they consist of a head (i.e. the part that contains the basic meaning of the CN) and modifiers, which restrict this meaning. Eg. “delivery company” • Our method can be summed up into four main steps
CN constituent disambiguation & pruning • 1.CN constituent disambiguation • head and modifiers disambiguation: by applying CWSD • 2.Redundant constituent identification and pruning • Redundant words: words that do not contribute new information, i.e. derived from the schema or from the lexical thesaurus • E.g. the attribute “company address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schema
CN interpretation via semantic relationships • 3. CN interpretation: selecting, among a set of predefined semantic relationships in our case the nine Levi’s relationships (CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE, BE, FROM) [Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]) the one that best captures the relationship between the head and the modifier • Intuition: the semantic relationship between head and modifier is the same holding between their unique beginners (i.e., the 25 top concepts in the noun WordNet hierarchy) we manually select the correct Levi’s relationship only for the couple of unique beginners MAKE Act#2 Group#1 … … • Why Levi’s relationships?: • they are suitable to interpret couple of unique beginners • a detailed and fine interpretation is not required in our context • they can be used during the CN gloss definition hyponym hyponym Institution#1 Transport#1 … … hyponym hyponym MAKE Company#1 Delivery#1
Creation of a new WN meaning for a CN the act of delivering or distributing something Modifier MAKE Head an institution created to conduct business an institution created to conduct business make the act of delivering or distributing something • 4.a Gloss definition Company#1 Gloss Delivery #1 Gloss Company#1 Delivery_Company#1 Delivery#1 SYNSETβ + Hypernym/Hyponym Related Term + • 4.b Inclusion of the new CN meaning in WN SYNSETµ Delivery_Company#1 Delivery_Company Gloss:
Experimental Evaluation • Evaluation over five different data sets (including relational and XML schemata) • Evaluating the lexical annotation process: • Evaluating the discovered lexical relationships: • Publications related toSchema Label Normalization : • S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po, Schema LabelNormalizationforImproving Schema Matching, DKE Journal, 2010. • S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po , Schema LabelNormalizationforImproving Schema Matching, ER 2009
Outline • Overview • Schema Matching • Lexical Annotation • The MOMIS Data Integration System • Open Problems and Contributions Semi-Automatic Lexical Annotation Schema Label Normalization Uncertainty in Automatic Annotation Conclusion & Future Work
Uncertainty in AutomaticAnnotation • In Automatic Lexical Annotation, uncertainty is assessed in terms of probability • The PWSD(Probabilistic Word Sense Disambiguation) algorithm: • automatically associates one/more WordNet meanings to schema labels • automatically assigns to each annotation a probability value that indicates the reliability of the annotation itself • is based on a probabilistic combination of different WSD algorithms • uses the Dempster-Shafer theory [Shafer, G., A Mathematical Theory of Evidence, Princeton 1976] to combine the results of the different WSD algorithms
Example Dempster-Shafer Theory SCHEMA LABELS WSD Algorithm 1 70% Confidence WSD Algorithm 2 60% Confidence WSD Algorithm 3 50% Confidence … TERMS ANNOTED WITH ALGORITHM 1 TERMS ANNOTED WITH ALGORITHM 2 TERMS ANNOTED WITH ALGORITHM N Schema Elements Annotations Prob. Value Source1.Book Source1.Book Source2.Brochure Source2.Book Heading book#1 book#3 brochure#1 heading#2 0.65 0.17 0.60 0.48 … … …
ProbabilisticLexicalRelationships • Starting from the probabilistic annotation, PWSD derives a set of probabilistic lexical relationships between schema elements WordNet First Sense PWSD 0.42 0.38 0.64 0.23 0.51 0.78 0.39 0.57 0.62 0.40 0.56
ExperimentalResults • Evaluation on 2 relational schemata of the Amalgam integration benchmark and 3 ontologies from the benchmark OAEI’06 • Evaluating the lexical annotation process: • Evaluating the discovered lexical relationships: * Threshold = 0.2 * Threshold = 0.15 • Publications related to PWSD: • L.Po, S.Sorrentino, Automatic generation of probabilistic relationships for improving schema matching,Information SystemsJournal, 2011 • L. Po, S.Sorrentino, S.Bergamaschi, D. Beneventano, Lexical knowledge extraction: an effective approach to schema and ontology matching, ECKM 2009
NORMS and ALA • The Schema Label Normalization functionalities have been implemented in a tool called NORMS (NORMalizer of Schemata)which allows the designer to enhance the normalized labels by correcting potential errors[S.Sorrentino, S.Bergamaschi, M.Gawinecki, NORMS: an automatic tool to perform schema label normalization, ICDE 2011] • CWSD and PWSD have been implemented in a tool called ALA (Automatic Lexical Annotator). It has been integrated within the MOMIS System [S.Bergamaschi, L.Po, S.Sorrentino, A.Corni, DealingwithUncertainty in LexicalAnnotation, ERPD 2009 ]
Conclusion • Automatic and Semi-Automatic methods to perform Label Normalization and Lexical Annotation have been presented: • CWSD • Schema Label Normalization • PWSD • Automatic methods to extract (probabilistic) lexical relationships have been proposed and their effectiveness in order to improve schema matching has been shown • All the methods have been implemented in the context of the MOMIS Data Integration System. However, they can be applied in the general contexts of schema and ontology matching
Future Work • New research lines: • inclusion and integration of other knowledge resources for automatic lexical annotation: • Domain-Specific Resources such as domain ontologies, domain thesauri etc. to address the problem of specific domain terms in schemata (e.g., the biomedical term “aromatase” which is an enzyme involved in the production of estrogen) • Generic resources: Wikipedia, dictionary etc. • inclusion of instance-information extraction techniques to improve the automatic annotation and relationship discovery processes and to solve the problem of non-informative schema labels • The use of RELEVANT [S. Bergamaschi, C. Sartori, F. Guerra, M. Orsini, Extracting Relevant Attribute Values for Improved Search. IEEE Internet Computing2007], which is a tool to extract (and add to the schema) metadata about the relevant instance values of an attribute, is a promising direction
Publications Journals: • Po, L. and Sorrentino, S. (2011). Automatic generation of probabilistic relationships for improving schema matching. Information Systems Journal, Special Issue on Semantic Integration of Data, Multimedia, and Services, 36(2):192208 • Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2010). Schema label normalization for improving schema matching. DKE Journal, 69(12):12541273. International Conferences and Workshops: • Sorrentino, S., Bergamaschi, S., and Gawinecki, M. (2011). NORMS: an automatic tool to perform schema label normalization. In Press, Accepted Manuscript (Demo Paper), IEEE International Conference on Data Engineering ICDE 2011, April 11-16, Hannover. • Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2009). Schema normalization for improving schema matching. In proceedings of the 28th International Conference on Conceptual Modeling, ER 2009, Gramado, Brasil, 9-12 November, pages 280-293. • Beneventano, D., Bergamaschi, S., and Sorrentino, S. (2009) Extending WordNet with compound nouns for semi-automatic annotation in data integration systems. In proceeding of the IEEE NLP-KE Conference, Dalian, China, 24-27 September 2009. • Bergamaschi, S., Po, L., Sorrentino, S., and Corni, A. (2009). Dealing with Uncertainty in Lexical Annotation. Revista de InformaticaTerica e Aplicada, RITA, ER 2009 Poster and Demonstrations Session,16(2):9396.
Publications • Beneventano, D., Orsini, M., Po, L., Antonio, S., and Sorrentino, S. (2009). An ontology-based data integration system for data and multimedia sources. In Proceeding of the Third International Conference on Semantic Computing, IEEE ICSC 2009, Berkeley, CA, USA - September 14-16, pages 606-611. IEEE Computer Society. • Beneventano, D., Orsini, M., Po, L., and Sorrentino, S. (2009). The MOMIS-STASIS approach for Ontology-Based Data Integration. In proceedings of the 1st International Workshop on Interoperability through Semantic Data and Service Integration, ISDSI 2009, Camogli (GE), Italy June 25. • Po, L., Sorrentino, S., Bergamaschi, S., and Beneventano, D. (2009). Lexical knowledge extraction: an effective approach to schema and ontology matching. Proceedings of the European Conference on Knowledge Management, ECKM 2009, 3-4 September Vicenza. • Bergamaschi, S., Po, L., Sala, A., and Sorrentino, S. (2007). Data source annotation in data integration systems. In Proceedings of the fifth International Workshop on Databases, Information Systems and Peer- to -Peer Computing, DBISP2P, at 33st International Conference on Very Large Data Bases (VLDB 2007), University of Vienna, Austria, September 24. • Bergamaschi, S., Po, L., and Sorrentino, S. (2007). Automatic Annotation in Data Integration Systems. In Proceeding of the OTM Workshops, Portugal, November 27-28.
Publications National Conferences • Bergamaschi, L. Po, S. Sorrentino, A. Corni, "Uncertainty in data integration systems: automatic generation of probabilistic relationships", VI Conference of the Italian Chapter of AIS, ITAIS 2009, , Costa Smeralda, Italy, October 2-3 2009. • S. Bergamaschi, S. Sorrentino, "Semi-automatic compound nouns annotation for data integration systems", Proceedings of the 17th Italian Symposium on Advanced Database Systems, SEBD 2009, Camogli (Genova), Italy 21-24 June 2009. • S. Bergamaschi, L. Po, and S. Sorrentino, "Automatic annotation for mapping discovery in data integration systems", Proceedings of the Sixteenth Italian Symposium on Advanced Database Systems, SEBD 2008, Mondello (Palermo), Italy, 22-25 June 2008 (pp 334-341). Book Chapters • Bergamaschi, S., Beneventano, D., Po, L., Sorrentino, S. (2011). Automatic Schema Mapping through Normalization and Annotation. In Press, in Second Search Computing Workshop: Challenges and Directions, 2010, LNCS State-of-the-Art Survey. • Bergamaschi S., Po L., Sorrentino S., Corni A.. “Uncertainty in data integration systems: automatic generation of probabilistic relationships”, to appeat at Management of the Interconnected World (A. D’Atri, M. De Marco, A. Maria Braccini, F. Cariddu eds.), Springer, ISBN/ISSN: 978-3-7908-2403-2, 2010.
Projects • NeP4B - Networked Peers for Business, MIUR funded research project – FIRB 2005 (2006- 2009) (http://www.dbgroup.unimo.it/nep4b) • STASIS - SofTware for Ambient Semantic Interoperable Services - Project FP6-2005-IST-5-034980 (2006-2008) (http://www.dbgroup.unimo.it/stasis/) • “Searching for a needle in mountains of data!” project funded by the FondazioneCassadiRisparmiodi Modena within the Bando diRicercaInternazionale (2008-2010) (http://www.dbgroup.unimo.it/keymantic)
Evaluation Measures FN:False Negative TP: True Positive FP: False Positive TN: True Negative • Recall : |TP| • |FN| + |TP| • Precision : |TP| • |TP| + |FP| • F-Measure: 2* Precision * Recall • Precision + Recall
Uniquebeginners • The top level concepts of the WordNet hierarchy are the 25 unique beginners (e.g., act, animal, artifact etc.) for WordNet English nouns defined in [Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K., WordNet: An on-line lexical database. International Journal of Lexicography, 1990]
Levi’s relationships set M = Modifier H = Head [Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]
Dempster-Shafer theory • The Dempster-Shafer theory is a mathematical theory of evidence. It allows to combine evidence from different sources: by using this theory for each algorithm, we assign a probability mass function m(·) to the set of all possible meanings for the term under consideration • The mass function of the WSD algorithms are combined by using the Dempster’s rule of combination • In the end, to obtain the probability assigned to each meaning, the belief mass function concerning a set of meanings is split