1 / 20

Natural Language Processing

Natural Language Processing. Overview. NLP systems. Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, question answering…

sugar
Télécharger la présentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Overview

  2. NLP systems • Fundamental goal: deep understand of broad language • Not just string processing or keyword matching! • End systems that we want to build: • Ambitious: speech recognition, machine translation, question answering… • Modest: spelling correction, text categorization…

  3. Machine Translation

  4. NLP applications • Text Categorization • Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative) • Spelling & Grammar Corrections • Information Extraction • Speech Recognition • Information Retrieval • Synonym Generation • Summarization • Machine Translation • Question Answering • Dialog Systems

  5. Why NLP is difficult • A NLP system needs to answer the question “who did what to whom” • Language is ambiguous • At all levels: lexical, phrase, semantic • Iraqi Head Seeks Arms • Word sense is ambiguous (head, arms) • Stolen Painting Found by Tree • Thematic role is ambiguous: tree is agent or location? • Ban on Nude Dancing on Governor’s Desk • Syntactic structure is ambiguous: • (Ban on Nude Dancing) on Governor’s Desk • Ban on (Nude Dancing on Governor’s Desk) • Hospitals Are Sued by 7 Foot Doctors • Semantics is ambiguous : what is 7 foot?

  6. Why NLP is difficult • Key problems: • Representation of meaning • Language presupposes knowledge about the world • Language only reflects the surface of meaning • Language presupposes communication between people

  7. Meaning • What is meaning? • Physical referent in the real world • Semantic concepts, characterized also by relations. • How do we represent and use meaning • I am Italian • From lexical database (WordNet) • Italian =a native or inhabitant of Italy Italy = republic in southern Europe [..] • I am Italian • Who is “I”? • I know she is Italian/I think she is Italian • How do we represent “I know” and “I think” • Does this mean that I is Italian? What does it say about the “I” and about the person speaking? • I thought she was Italian • How do we represent tenses?

  8. Ad doc approaches to tackle NLP problem • How can a machine understand these differences? • Decorate the cake with the frosting • Decorate the cake with the kids (i.e., kids' help) • Rules based approaches, i.e. hand coded syntactic constraints and preference rules: • The verb decorate require an animate being as agent • The object cake is formed by any of the following, inanimate entities (cream, dough, frosting…..) • Such approaches have been showed to be time consuming to build, do not scale up well and are very brittle to new, unusual, metaphorical use of language • To swallow requires an animate being as agent/subject and a physical object as object • I swallowed his story • The supernova swallowed the planet

  9. Corpus-based statistical approaches to tackle NLP problem • A Statistical NLP approach seeks to solve these problems by automatically learning lexical and structural preferences from text collections (corpora) • Statistical models are robust, generalize well and behave gracefully in the presence of errors and new data. • So: • Get large text collections • Compute statistics over those collections • (The bigger the collections, the better the statistics)

  10. Classify the document into semantics topics

  11. Counting word occurrences • From (labeled) corpora we can learn that: #(sport documents containing word Cup) > #(disaster documents containing word Cup) -- feature • We then need a statistical model for the topic assignment

  12. Title URL Information extraction

  13. System Architecture

  14. Text Zoning

  15. URL Finding Rules • Use pattern to capture URLs • Approaches for finding an event URL • Search Summary zone • Search the whole document • Results

  16. Dates Finding Rules • Use pattern to capture Dates • Use clues to find corresponding date • submission-date < start-date <= end-date • no submission-date in a “Call for Participation” announcement • etc. • Results

  17. Current Performance

  18. Implementation Details • Python 2.6 • Gazetteer from http://world-gazetteer.com/ • Support Vector Machine http://svmlight.joachims.org/ • Natural Language Toolkit (NLTK) http://www.nltk.org/Home

  19. Web analytics • Data-mining of Weblogs, discussion forums, message boards, user groups, and other forms of user generated media • Product marketing information • Political opinion tracking • Social network analysis • Buzz analysis (what’s hot, what topics are people talking about right now).

  20. Web analytics companies

More Related