Schema-based Natural Language Semantic Parsing

Schema-based Natural Language Semantic Parsing Niculae Stratica and Bipin C. Desai Department of Computer Science Concordia University 1455 de Maisonneuve Blvd. West, Montreal, H3G 1M8, Canada nstratica@primus.ca, bcdesai@cs.concordia.ca

Introduction 1/2 • This paper addresses the mapping of Natural Language to SQL queries. • It details a methodology to build the SQL query based on: • The input sentence • A dictionary • A set of production rules

Introduction 2/2

SummaryEarly workTemplate based parsingToken matching parsingImplementationCapabilities and LimitationsFuture work

Early work Natural Language Processing through learning algorithms and statistical methods Ref. [3] Minker, J., Information storage and retrieval - a survey and functional description, SIGIR, 12, pp.1-108, 1997

Template-based parsing (NLIDB) Ref. [2] N. Stratica, L. Kosseim and B.C. Desai, NLIDB Templates for Semantic Parsing. Proceedings of Applications of Natural Language to Data Bases, NLDB’2003, pp. 235-241, June 2003, Burg, Germany.

Token-matching parsing Ref. N. Stratica and B.C. Desai, Schema-based Natural Language Semantic Mapping. Proceedings of Applications of Natural Language to Data Bases, NLDB’2004, June 2004, Manchester, England.

Template versus Token-matching parsing

From the NL input to the SQL query

The Database Schema Table AUTHORS (LONG id PRIMARY_KEY, CHAR name, DATE DOB) Table BOOKS (LONG id PRIMARY_KEY, CHAR title, LONG isbn, LONG pages, DATE date_published, ENUM type) Table BOOKAUTHORS (LONG aid references AUTHORS, LONG bid references BOOKS) Table STUDENTS (LONG id PRIMARY_KEY, CHAR name, DATE DOB) Table BORROWEDBOOKS (LONG bid references BOOKS, LONG sid references STUDENTS)

Wordnet and the Semantic Sets 1/2 For table AUTHORS WordNet returns the following list: WordNet 2.0 Search Overview for ‘author’ The noun ‘author’ has 2 senses in WordNet. 1. writer, author -- (writes (books or stories or articles or the like) professionally (for pay)) 2. generator, source, author -- (someone who originates or causes or initiates something; ‘he was the generator of several complaints’) Ref [1] Miller, G., WordNet: A Lexical Database for English, Communications of the ACM, 38 (1), pp. 39-41, November 1995

Wordnet and the Semantic Sets 2/2 Results for ‘Synonyms, hypernyms and hyponyms ordered by estimated frequency’ search of noun ‘authors’ 2 senses of authorSense 1 writer, author => communicatorSense 2 generator, source, author => maker, shaper The semantic set for AUTHORS becomes: {writer, author, generator, source, maker, communicator, shaper}

Indexing the ENUM Types TABLE=BOOKS ATTRIBUTE TYPE ENUM={NOVEL, POEM} If any of the TYPE values occurs in the input sentence, a new production rule is added to the SQL relating BOOKS.TYPE to VALUE such as the one in the example below: Sentence: “List all novels” NOVEL is a valid value for BOOK.TYPE The production rule is: WHERE BOOKS.TYPE=‘NOVEL’ The resulting SQL query is: SELECT BOOKS.* FROM BOOKS WHERE BOOKS.TYPE=‘NOVEL’

The Production Rules Based on the Database schema, the pre-processor builds the following production rule: IF AUTHORS in Table List AND BOOKS in Table List Then BOOKAUTHORS is in Table List and the following SQL template: SELECT Attribute List FROM Table List WHERE BOOKAUTHORS.AID=AUTHORS.ID AND BOOKAUTHORS.BID=BOOKS.ID

The Workflow

Capabilities and Limitations 1/5 ‘Show all novels written by Mark Twain and William Shakespeare that have been borrowed by John Markus’ The method retains the following tokens: ‘... ... novels ... .. Mark Twain ... William Shakespeare ... ... ... borrowed ... John Markus’ The token ‘novels’ point to table BOOKS. ‘novels’ is found in the INDEX files for ENUM values of the attribute BOOKS.TYPE

Capabilities and Limitations 2/5 ‘novels’ matches the ENUM value BOOKS.TYPE=’NOVEL’ ‘Mark Twain’ matches the AUTHORS.NAME=’Mark Twain’ ‘William Shakespeare’ matches the AUTHOR.NAME=’William Shakespeare’ ‘borrowed’ is disambiguated at run time through BORROWEDBOOKS ‘John Markus’ matches STUDENTS.NAME=’John Markus’ Schema correlates AUTHORS, BOOKS and BOOKAUTHORS Schema correlates STUDENT, BOOKS and BORROWEDBOOKS The table list is: AUTHORS, BOOKS, BOOKAUTHORS,STUDENTS, BORROWEDBOOKS

Capabilities and Limitations 3/5 The SQL Constraints are: BOOKAUTHORS.AID=AUTHORS.ID AND BOOKAUTHORS.BID=BOOKS.ID AND BOOKS.TYPE=’NOVEL’ AND (AUTHORS.NAME=’Mark Twain’ OR AUTHORS.NAME=’William Shakespeare’) AND STUDENTS.NAME=’John Markus’ AND BOOKS.ID=BORROWEDBOOKS.BID AND STUDENT.ID=BORROWEDBOOKS.SID The two constraints to the AUTHORS.NAME have been OR-ed because they point to the same attribute. The method allows to construct the correct SQL query. The method can address context and value ambiguities

Capabilities and Limitations 4/5 1. The current architecture does not support operators such as: greater then, less then, count, average and sum. 2. It does not resolve dates as in: before, after, between. 3. The generated SQL does not support imbricate queries. 4. The proposed method eliminates all tokens that cannot be matched with either the semantic sets or with the index files and it works for semantically stable databases. 5. The preprocessor must be used after each semantic update of the database in order to modify the index files.

Capabilities and Limitations 5/5 6. The context disambiguation is limited to the semantic sets related to a given schema. 7. Errors related to tokenizing, WordNet and the human intervention propagate in the SQL query. 8. The method completely disregards the unmatched tokes and thus it cannot correct the input query if it has errors. 9. However, the method correctly interprets the tokens that are found in the semantic sets or among the derivationally related terms at run time

Future Work The future work will focus on the operator resolution. We believe that the approach presented in this paper can give good results with a minimum of effort in implementation and avoids specific problems related to the various existing semantic analyses approaches. This is partly made possible by the highly organized data in the RDBMS. The method will be implemented and the results will be measured against complex sentences involving more than 4 tables from the database. A study will be done to show the performance dependency on the size of the database records and on the database schema.

References [1] Miller, G., WordNet: A Lexical Database for English, Communications of the ACM, 38 (1), pp. 39-41, November 1995 [2] N. Stratica, L. Kosseim and B.C. Desai, NLIDB Templates for Semantic Parsing. Proceedings of Applications of Natural Language to Data Bases, NLDB’2003, pp. 235-241, June 2003, Burg, Germany. [3] Minker, J., Information storage and retrieval - a survey and functional description, SIGIR, 12, pp.1-108, 1997 [4] Stuart H. Rubin, Shu-Ching Chen, and Mei-Ling Shyu, Field-Effect Natural Language Semantic Mapping, Proceedings of the 2003 IEEE International Conference on Systems, Man & Cybernetics, pp. 2483-2487, October 5-8, 2003, Washington, D.C., USA. [5] Lawrence J. Mazlack, Richard A. Feinauer, Establishing a Basis for Mapping Natural-Language Statements Onto a Database Query Language, SIGIR 1980: 192-202 [6] Allen, James, Natural Language Understanding, University of Rochester 1995 The Benjamin Cummings Publishing Company, Inc. ISBN: 0-8053-0334-0 [7] Kathryn Baker, Alexander Franz, and Pamela Jordan, Coping with Ambiguity in Knowledge-based Natural Language Analysis, Florida AI Research Symposium, pages 155-159, 1994 Pensacola, Florida [8] Hirst G., Semantic Interpretation and the Resolution of Ambiguity, Cambridge University Press 1986, Cambridge [9] Latent Semantic Analysis Laboratory at the Colorado University, http://lsa.colorado.edu/ site visited in March 2004 [10] Sleator D., Davy Temperley, D., Parsing English with A Link Grammar,Proceedings of the Third Annual Workshop on Parsing Technologies, 1993

Schema-based Natural Language Semantic Parsing