1 / 90

Tobias Blanke

How do computers understand Texts. Tobias Blanke. My contact details. Name Tobias Blanke Telephone                 020 7848 1975 Email                        tobias.blanke@kcl.ac.uk Address                 51 Oakfield Road (!); N4 4LD. Outline.

ozzie
Télécharger la présentation

Tobias Blanke

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How do computers understand Texts Tobias Blanke

  2. My contact details • Name Tobias Blanke • Telephone                020 7848 1975 • Email                        tobias.blanke@kcl.ac.uk • Address                 51 Oakfield Road (!); N4 4LD

  3. Outline • How do computers understand texts so that you don’t have to read them? • The same steps: • We stay with searching for a long time. • How to use text analysis for Linked Data • You will build your own Twitter miner

  4. Why? – A Simple question … • Suppose you have a million documents and a question – what do you do? • Solution : User can read all the documents in the store, retain the relevant documents and discard all the others – Perfect Retrieval… NOT POSSIBLE !!! • Alternative : Use a High Speed Computer to read entire document collection and extract the relevant documents.

  5. Data Geeks are in demand New research by the McKinsey Global Institute (MGI) forecasts a 50 to 60 percent gap between the supply and demand of people with deep analytical talent. http://jonathanstray.com/investigating-thousands-or-millions-of-documents-by-visualizing-clusters

  6. The Search Problem

  7. The problem of traditional text analysis is retrieval • Goal = find documents relevant to an information need from a large document set Information need? Query Magicsystem Document collection Retrieval Answer list

  8. Example Google Web

  9. Search problem • First applications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • external attributes and internal attribute (content) • Search by external attributes = Search in databases • IR: search by content

  10. Text Mining • Text mining is used to describe the application of data mining techniques to automated discovery of useful or interesting knowledge from unstructured text. • Task: Discuss with your neighbour what a system needs to • Determine who is a terrorist • Determine the sentiments

  11. The big Picture IR is easy …. Let’s stay with search for a while

  12. Search still is the biggest application • Security applications: Search for the villain • Biomedical applications: Semantic Search • Online media applications: Disambiguate Information • Sentiment analysis: Find ‘nice’ movies • The human consumption is still key

  13. Why is the human so important • Because we talk about information and understanding remains a human domain • “There will be information on the Web that has a clearly defined meaning and can be analysed and traced by computer programs: there will be information, such as poetry and art, that requires the whole human intellect for an understanding that will always be subjective.” (Tim Berners-Lee, Spinning the Semantic Web) • “There is virtually no “semantics” in the semantic web. (…) Semantic content, in the Semantic Web, is generated by humans, ontologised by humans, and ultimately consumed by humans. Indeed, it is not unusual to hear complaints about how difficult it is to find and retain good ‘ontologists’.” (https://uhra.herts.ac.uk/dspace/bitstream/2299/3629/1/903250.pdf)

  14. The Central Problem: The Human Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts?

  15. The Black Box Documents Query Results Slide is from Jimmy Lin’s tutorial

  16. Inside The IR Black Box Documents Query Representation Representation Query Representation Document Representation Index Comparison Function Results Slide is from Jimmy Lin’s tutorial

  17. Possible approaches 1. String matching (linear search in documents) - Syntactical - Difficult to improve 2. Indexing - Semantics - Flexible to further improvement

  18. Indexing-based IRSimilarity text analysis Document Query/Document indexingindexing (Queryanalysis) Representation Representation (keywords) Query (keywords) evaluation “How is this document similar to the query/another document?” Slide is from Jimmy Lin’s tutorial

  19. Main problems • Document indexing • How to best represent their contents? • Matching • To what extent does an identified information source correspond to a query/document? • System evaluation • How good is a system? • Are the retrieved documents relevant? (precision) • Are all the relevant documents retrieved? (recall)

  20. Indexing

  21. Document indexing • Goal = Find the important meanings and create an internal representation • Factors to consider: • Accuracy to represent meanings (semantics) • Exhaustiveness (cover all the contents) Coverage Accuracy String Word Phrase Concept Slide is from Jimmy Lin’s tutorial

  22. Text Representations Issues • In general, it is hard to capture these features from a text document • One, it is difficult to extract this automatically • Two, even if we did it, it won't scale! • One simplification is to represent documents as a bag of words • Each document is represented as a bag of the word it contains, and each component of the bag represents some measurement of the relative importance of a single word.

  23. Some immediate problems • How do we compare these bags of word to find out whether they are ‘similar’? • Let’s say we have three bags: • “House, Garden, House door” • “Household, Garden, Flat” • “House, House, House, Gardening” • How do we normalise these bags? • Why is normalisation needed? • What would we want to normalise?

  24. Keyword selection and weighting • How to select important keywords?

  25. Luhn’s Ideas • Frequency of word occurrence in a document is a useful measurement of word significance

  26. Zipf and Luhn

  27. Top 50 Terms WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences) TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences)

  28. Scholarship and the Long Tail • Scholarship follows a long-tailed distribution: the interest in relatively unknown items decline much more slowly than they would be if popularity were described by a normal distribution • We have few statistical tools for dealing with long-tailed distributions • Other problems include ‘contested terms’ Graham White, "On Scholarship" (in Bartscherer ed., Switching Codes)

  29. Stopwords / Stoplist • Some words do not bear useful information. Common examples: of, in, about, with, I, although, … • Stoplist: contain stopwords, not to be used as index • Prepositions • Articles • Pronouns • http://www.textfixer.com/resources/common-english-words.txt

  30. Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation Is it always good to stem? Give examples! comput Slide is from Jimmy Lin’s tutorial

  31. Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) http://qaa.ath.cx/porter_js_demo.html • Step 1: plurals and past participles • SSES -> SS caresses -> caress • (*v*) ING -> motoring -> motor • Step 2: adj->n, n->v, n->adj, … • (m>0) OUSNESS -> OUS callousness -> callous • (m>0) ATIONAL -> ATE relational -> relate • Step 3: • (m>0) ICATE -> IC triplicate -> triplic • Step 4: • (m>1) AL -> revival -> reviv • (m>1) ANCE -> allowance -> allow • Step 5: • (m>1) E -> probate -> probat • (m > 1 and *d and *L) -> single letter controll -> control Slide is from Jimmy Lin’s tutorial

  32. Lemmatization • transform to standard form according to syntactic category. Produce vs Produc- E.g. verb + ing  verb noun + s  noun • Need POS tagging • More accurate than stemming, but needs more resources Slide partly taken from Jimmy Lin’s tutorial

  33. Index Documents ( Bag of Words Approach) INDEX DOCUMENT Document Analysis Text Is This This is a document in text analysis

  34. Result of indexing • Each document is represented by a set of weighted keywords (terms): D1 {(t1, w1), (t2,w2), …} e.g. D1  {(comput, 0.2), (architect, 0.3), …} D2  {(comput, 0.1), (network, 0.5), …} • Inverted file: comput  {(D1,0.2), (D2,0.1), …} Inverted file is used during retrieval for higher efficiency. Slide partly taken from Jimmy Lin’s tutorial

  35. Inverted Index Example Doc 1 Dictionary Postings This is a sample document with one sample sentence Doc 2 This is another sample document Slide is from ChengXiang Zhai

  36. Similarity

  37. Similarity Models • Boolean model • Vector-space model • Many more

  38. Boolean model • Document = Logical conjunction of keywords • Query = Boolean expression of keywords e.g. D = t1  t2  …  tn Q = (t1 t2)  (t3 t4) Problems: • many documents or few documents • End-users cannot manipulate Boolean operators correctly E.g. documents about poverty andcrime

  39. Vector space model • Vector space = all the keywords encountered <t1, t2, t3, …, tn> • Document D = < a1, a2, a3, …, an> ai = weight of ti in D • Query Q = < b1, b2, b3, …, bn> bi = weight of ti in Q • R(D,Q) = Sim(D,Q)

  40. dj θ dk Cosine Similarity Similarity calculated using COSINE similarity between two vectors

  41. Tf/Idf • tf = term frequency • frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. • df = document frequency • no. of documents containing the term • distribution of the term • idf = inverse document frequency • the unevenness of term distribution in the corpus • the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t)

  42. Exercise (1) Define term/document matrix • D1: The silver truck arrives • D2: The silver cannon fires silver bullets • D3: The truck is on fire (2) Compute TF/IDF from Reuters

  43. Let’s code our first text analysis engine search.pl

  44. Our corpus • A study on Kant’s critique of judgement • Aristotle's Metaphysics • Hegel’s Aesthetics • Plato’s Charmides • McGreedy’s War Diaries • Excerpts from the Royal Irish Society

  45. Text Analysis is an Experimental Science!

  46. Text Analysis is an Experimental Science! • Formulate a hypothesis • Design an experiment to answer the question • Perform the experiment • Does the experiment answer the question? • Rinse, repeat…

  47. Test Collections • Three components of a test collection: • Test Collection of documents • Set of topics • Sets of relevant document based on expert judgments • Metrics for assessing ‘performance’ • Precision • Recall

  48. Precision vs. Recall All docs Retrieved Relevant Slide taken from Jimmy Lin’s tutorial

  49. The TREC experiments • Once per year • A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) • Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July) • NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) • TREC conference (November)

  50. Towards Linked Data Beyond the Simple Stuff

More Related