1 / 126

Natural Language Processing Applications

Natural Language Processing Applications. Lecture 7 Fabienne Venant Université Nancy2 / Loria. Information Retrieval. What is Information Retrieval?.

Sharon_Dale
Télécharger la présentation

Natural Language Processing Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Applications Lecture 7 Fabienne Venant Université Nancy2 / Loria

  2. Information Retrieval

  3. What is Information Retrieval? • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) • Applications: • Many universities and public libraries use IR systems to provide access to books journals and other documents. • Web search • Large volumes of unstable, unstructured dat • Speed is important • Cross-language IR • Finding documents written in another language • Touches on Machine translation • ....

  4. Concerns • The set of texts can be very large hence hence efficiency is a concern • Textual data is noisy, incomplete and untrustworthy hence robustness is a concern • Information may be hidden: • Need to derive information from raw data • Need to derive information from vaguely expressed needs

  5. IR Basic concepts • Information needs : queries and relevance • Indexing: helps speeding up retrieval • Retrieval models: describe how to search and recover relevant documents • Evaluation: IR systems are large and convincing evaluation is tricky

  6. Information needs

  7. Information needs • INFORMATION NEED : the topic about which the user desires to know more • QUERY : what the user conveys to the computer in an attempt to communicate the information need • RELEVANCE : a document is relevant if it is one that the user perceives as containing information of value wrt their personal information need Ex : • topic “pipeline leaks” • relevant documents : doesn’t matter if they use those words or express the concept with other words such a « pipeline rupture ».

  8. Capturing information needs • Information needs can be hard to capture • One possibility : use natural language • Advantage: expressive enough to allow all needs to be described • Drawbacks: • Semantic analysis of arbitrary NL is very hard • Users may not want to type full blown sentences into a search engine

  9. Queries

  10. Queries • Information needs are typically expressed as a query : • Where shall I go on holiday? holiday destinations • Two main types of possible queries • How much blood does the human heart pump in one minute? • Boolean queries :  heart AND blood AND minutes • Web types queries :  human biology

  11. Remarks • A query : • is usually quite short and incomplete; • may contain misspelled or poorly selected words • may contain too many or too few words • The information need : • may be difficult to describe precisely,especially when the user isn't familiar about the topic • Precise understanding of the document content is difficult.

  12. Persistent vs one-off Queries Queries might or not evolve over times • Persistent queries : • predefined and routinely performed : • Top ten performing shares today • Continuous queries : persistent queries that allow users to receive new results when they become available • typical of Information extraction and News Routing systems • One-off (or ad-hoc) queries • created to obtain information as the need arises • typical of Web searching

  13. Relevance • Relevance is subjective • ’python’ : ambiguous but not for user • Topicality vs. Utility: a document is relevant wrt a specific Goal  A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. • Relevance is a gradual concept (a document is not just relevant or not; it is more or less relevant to a query) • IR systems usually rank retrieved documents by relevance • But many algorithm use a binary decision of relevance.

  14. The big picture

  15. Terminology • An IR system looks for data matching some criteria defined by the users in their queries. • The langage used to ask a question is called the query language. • These queries use keywords (atomic items characterizing some data). • The basic unit of data is a document (can be a file, an article, a paragraph, etc.). • A document corresponds to free text (may be unstructured). • All the documents are gathered into a collection (or corpus).

  16. Searching for a given word in a document • One way to do that is to start at the beginning and to read through all the text • Pattern matching (re) + speed of modern computer grepping through tex can be a very effective • Enough for simple querying of modest collections (millions of words) • But for many purposes, you do need more: • To process large document collections (billions ot trillions of words) quickly. • To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”. • To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words -- >You need an Index

  17. Index

  18. Motivation for Indexing • Extremely large dataset • Only a tiny fraction of the dataset is relevant to a given query • Speed is essential (0.25 second for web searching) • Indexing helps speedup retrieval

  19. Indexing documents • How to relate the user’s information need with some documents’ content ? • Idea : using an index to refer to documents • Usually an index is a list of terms that appear in a document, it can be represented mathematically as: index : doci→ {Uj keywordj} • Here, the kind of index we use maps keywords to the list of documents they appear in: index′ : keywordj → {Ui doci} • We call this an inverted index.

  20. Indexing documents • The set of keywords is usually called the dictionary (or vocabulary) • A document identifier appearing in the list associated with a keyword is called a posting • The list of document identifiers associated with a given keyword is called a posting list

  21. Inverted files The most common indexing technique • Source file: collection organised by documents • Inverted file: collection organised by terms

  22. Inverted Index • Given a dictionary of terms (also called vocabulary or vocabulary lexicon) • For each term, record in a list which documents the term occurs in • Each item in the list: • records that a term appeared in a document • and, later, often, the positions in the document • is conventionally called a posting • The list is then called a postings list (or inverted list),

  23. Inverted Index From « an introduction to information retrieval », C.D. Manning,P. Raghavan and H.Schütze

  24. Exercise Draw the inverted index that would be built for the following document collection • Doc 1 breakthrough drug for schizophrenia • Doc 2 new schizophrenia drug • Doc 3 new approach for treatment of schizophrenia • Doc 4 new hopes for schizophrenia patients For this document collection, what are the returned results for these queries: • schizophrenia AND drug • schizophrenia AND NOT(drug OR approach)

  25. Indexing documents • Arising questions: how to build an index automatically ? What are the relevant keywords ? • Some additional desiderata: • fast processing of large collections of documents, • having flexible matching operations (robust retrieval), • having the possibility to rank the retrieved document in terms of relevance • To ensure these requirements (especially fast processing) are fulfilled, the indexes are computed in advance • Note that the format of the index has a huge impact on the performances of the system

  26. Indexing documents NB: an index is built in 4 steps: • Gathering of the collection (each document is given a unique identifier) • Segmentation of each document into a list of atomic tokens  tokenization • Linguistic processing of the tokens in order to normalize them lemmatizing. • Indexing the documents by computing the dictionary and lists of postings

  27. Manual indexing • Advantages • Human judgement are most reliable • Retrieval is better • Drawbacks • Time consuming • Not always consistent • different people build different indexes for the same document.

  28. Automatic indexing • Using NLU? • Not fast enough in real world settings (e.g., web search) • Not robust enough (low coverage) • Difficulty : what to include and what to exclude. • Indexes should not contain headings for topics for which there is no information in the document • Can a machine parse full sentences of ideas and recognize the core ideas, the important terms, and the relationships between related concepts throughout the entire text?

  29. Building the vocabulary

  30. Stop list • The members of which are discarded during indexing • some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. • These words are called STOP WORDS • Collection strategy : • Sort the terms by collection frequency (the total number of times each term appears in the document collection), • Take the most frequent terms • often hand-filtered for their semantic content relative to the domain of the documents being indexed • What counts as a stop word depends on the collection • in a collection of legal article law can be considered a stop word • Ex: • a an and are as at be by for from has he in is it its of on that the to was were will with

  31. Why eliminate stop words? • Efficiency • Eliminating stop words reduces the size of the index considerably • Eliminating stop words reduces retrieval time considerably • “Quality of results” • Most of the time not indexing stop words does little harm • keyword searches with terms like the and by don’t seem very useful • BUT, this is not true for phrase searches. • The phrase query “President of the United States” is more precise than President AND “United States”. • The meaning of “ flights to London “ is likely to be lost if the word to is stopped out. • .....

  32. Building the vocabulary • Processing a stream of characters to extract keywords • 1st task: tokenization, main difficulties: • token delimiters (ex: Chinese) • apostrophes (ex: O’neill, Finland’s capital) • hyphens (ex: Hewlett-Packard, state-of-the-art) • segmented compound nouns (ex: Los Angeles) • unsegmented compound nouns (icecream, breadknife) • numerical data (dates, IP addresses) • word order (ex: Arabic wrt nouns and numbers)

  33. Solutions for tokenization issues: • Using a pre-defined dictionary with largest matches and heuristics for unknown words • Using learning algorithms trained over hand-segmented words

  34. Choosing keywords • Selecting the words that are most likely to appear in a query • These words characterize the documents they appear in • Which are they?

  35. The bag of words approach • Extreme interpretation of the the principle of compositional semnaics • The meaning of documents resides solely in the words that are contained within them • The exact ordering of the terms in a document is ignored but the number of occurrences of each term is material

  36. BoW “Not the same thing a bit!” said the Hatter. “You might just as well say that ‘I see what Ieat’ is the same thing as ‘I eat what I see’!” “You might just as well say,” added the March Hare, “that ‘I like what I get’ is the same thing as ‘I get what I like’!” “You might just as well say,” added the Dormouse, who seemed to be talking in its sleep, “that ‘I breathe when I sleep’ is the same thing as ‘I sleep when I breathe’!”

  37. Bags of words • Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content..

  38. What’s in a bag of words? • Are all words in a document equally important? • stop words do not contribute in any way to retrieval and scoring • BoW contain terms • What should count as a term? • Words • Phrases (e.g., president of the US)

  39. Morphological normalization • Should index terms be word forms, lemmas or stems? • Matching morphological variants increase recall • Example morphological variants : • anticipate, anticipating, anticipated, anticipation • Company/Companies, sell/sold • USA vs U.S.A., • 22/10/2007 vs 10/22/2007 vs 2007/10/22 • university vs University • Idea: using equivalence classes of terms, • ex: { Opel, OPEL, opel }  opel • Two techniques: • stemming : refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time • Lemmatisation : refers to doing things,properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return a dictionary form of a word, which is known as the lemma. • NB: documents and queries have to be processed using the same tokenization process !

  40. Stemming and Lemmatization • Role: reducing inflectional forms to common base forms, • Example: • car, cars, car’s, cars’  car • am, are, is  be • Stemming removes suffixes (surface markers) to produce root forms • Lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser) • Illustration of the difficulty: • plurals (woman/women, crisis/crisis) • derivational morphology (automatize/automate) • English  Porter stemming algorithm (University of Cambridge, UK, 1980)

  41. Porter stemmer • Algorithm based on a set of context-sensitive rewriting rules http://tartarus.org/~martin/PorterStemmer/index.html http://tartarus.org/~martin/PorterStemmer/def.txt • Rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example: (.*)sses  \1 ss sses  ss : caresses  caress (.* [aeiou].*)ies  \1i ies  i : ponies  poni, ties ti (.* [aeiou].*)ss  \1 ss ss  ss : caress  caress • Rules may be constrained by conditions on the word’s measure, example: (m > 1) (.*)ement  \1 replacement replac but not cement c (m>0) (.*)eed -> \1ee feed -> feed but agreed -> agree (*v*) ed -> \1 plastered -> plaster but bled -> bled (*v*) ing -> \1 motoring -> motor but sing -> sing

  42. Porter StemmerWord measure • Assumed that a list of consonants is denoted by C, and a list of vowels by V • Any word, or part of a word has one of the four forms: • CVCV ... C • CVCV ... V • VCVC ... C • VCVC ... V • These may all be represented by the single form • [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents. • Using (VC)m to denote VC repeated m times, this may again be written as • [C](VC)m[V]. • m will be called the measure of any word or word part when represented in this form. • Here are some examples: • m=0 TR,   EE,   TREE,   Y,   BY • m=1 TROUBLE,   OATS,   TREES,   IVY • m=2 TROUBLES,   PRIVATE,   OATEN,   ORRERY. • (m > 1) EMENT -> • This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.

  43. Exercise • What is the Porter measure of the following words (give your computation) ? • crepuscular • rigorous • placement cr ep usc ul ar C VC VC VC VC m = 4 r ig or ous C VC VC VC m = 3 pl ac em ent C VC VC VC m = 3

  44. Stemming • Most stemmers also removes suffixes such as ed, ing, ational, ation, able, ism... • Relational  relate • Most stemmers don’t use lexical look up • There are shortcomings: • Stemming can result in non-words • Organization  Organ • Doing  doe • Unrelated words can be reduced to the same stem • police, policy polic

  45. Stemming • Popular stemmers • Porter’s • Lovin’s • Iterated Lovin’s • Kstem

  46. Lemmatization • Exceptions needs to be handled: • sought  seek, sheep sheep, feet foot • Computationally more expensive than stemming as it lookups words in a dictionnary • Lemmatizer for French • http://bach.arts.kuleuven.be/pmertens/morlex/ • FLEMM (F. Namer) • POS taggers with lemmatization: TreeTagger, LT-POS

  47. What is actually used? • Most retrieval systems use stemming/lemmatising and stop word lists • Stemming increases recall while harming precision • Most web search engines do use stop word lists but not stemming/lemmatising because • the text collection is extremely large so that the change of matching morphogical variants is higher • recall is not an issue • stemming is imperfect and the size and diversity of the web increase the chance of a mismatch • stemming/tokenising tools are available for few languages

  48. Example Text Representations Scientists have found compelling new evidence of possible ancient microscopic life on mars, derived from magnetic crystals in a meteorit that fell to Earth from the red planet, NASA anounced on Monday. Web search: scientists, found, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystals, meteorite, fell, earth, red, planet, NASA, anounced, Monday Information service or library search: scientist, find, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystal, meteorite, fall, earth, red, planet, NASA, anounce, Monday

  49. Granularity • Document unit : • An index can map terms • ... to documents • ... to paragraphs in documents • ... to sentences in document • ... to positions in documents • An IR system should be designed to offer choices of granularity. • For now, we will henceforth assume that a suitable size document unit has been chosen, together with an appropriate way of dividing or aggregating files, if needed.

  50. Index Content • The index usually stores some or all of the following information: • For each term: • Document count. How many documents the term occurs in. • Total Frequency count. How many times the term occurs accross all documents  “popularity measure” • For each term and for each document: • Frequency : How often the term occurs in that document. • Position. The offsets at which the term occurs in that document.

More Related