Tobias Blanke

How do computers understand Texts Tobias Blanke

My contact details • Name Tobias Blanke • Telephone 020 7848 1975 • Email tobias.blanke@kcl.ac.uk • Address 51 Oakfield Road (!); N4 4LD

Outline • How do computers understand texts so that you don’t have to read them? • The same steps: • We stay with searching for a long time. • How to use text analysis for Linked Data • You will build your own Twitter miner

Why? – A Simple question … • Suppose you have a million documents and a question – what do you do? • Solution : User can read all the documents in the store, retain the relevant documents and discard all the others – Perfect Retrieval… NOT POSSIBLE !!! • Alternative : Use a High Speed Computer to read entire document collection and extract the relevant documents.

Data Geeks are in demand New research by the McKinsey Global Institute (MGI) forecasts a 50 to 60 percent gap between the supply and demand of people with deep analytical talent. http://jonathanstray.com/investigating-thousands-or-millions-of-documents-by-visualizing-clusters

The Search Problem

The problem of traditional text analysis is retrieval • Goal = find documents relevant to an information need from a large document set Information need? Query Magicsystem Document collection Retrieval Answer list

Example Google Web

Search problem • First applications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • external attributes and internal attribute (content) • Search by external attributes = Search in databases • IR: search by content

Text Mining • Text mining is used to describe the application of data mining techniques to automated discovery of useful or interesting knowledge from unstructured text. • Task: Discuss with your neighbour what a system needs to • Determine who is a terrorist • Determine the sentiments

The big Picture IR is easy …. Let’s stay with search for a while

Search still is the biggest application • Security applications: Search for the villain • Biomedical applications: Semantic Search • Online media applications: Disambiguate Information • Sentiment analysis: Find ‘nice’ movies • The human consumption is still key

Why is the human so important • Because we talk about information and understanding remains a human domain • “There will be information on the Web that has a clearly defined meaning and can be analysed and traced by computer programs: there will be information, such as poetry and art, that requires the whole human intellect for an understanding that will always be subjective.” (Tim Berners-Lee, Spinning the Semantic Web) • “There is virtually no “semantics” in the semantic web. (…) Semantic content, in the Semantic Web, is generated by humans, ontologised by humans, and ultimately consumed by humans. Indeed, it is not unusual to hear complaints about how difficult it is to find and retain good ‘ontologists’.” (https://uhra.herts.ac.uk/dspace/bitstream/2299/3629/1/903250.pdf)

The Central Problem: The Human Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts?

The Black Box Documents Query Results Slide is from Jimmy Lin’s tutorial

Inside The IR Black Box Documents Query Representation Representation Query Representation Document Representation Index Comparison Function Results Slide is from Jimmy Lin’s tutorial

Possible approaches 1. String matching (linear search in documents) - Syntactical - Difficult to improve 2. Indexing - Semantics - Flexible to further improvement

Indexing-based IRSimilarity text analysis Document Query/Document indexingindexing (Queryanalysis) Representation Representation (keywords) Query (keywords) evaluation “How is this document similar to the query/another document?” Slide is from Jimmy Lin’s tutorial

Main problems • Document indexing • How to best represent their contents? • Matching • To what extent does an identified information source correspond to a query/document? • System evaluation • How good is a system? • Are the retrieved documents relevant? (precision) • Are all the relevant documents retrieved? (recall)

Indexing

Document indexing • Goal = Find the important meanings and create an internal representation • Factors to consider: • Accuracy to represent meanings (semantics) • Exhaustiveness (cover all the contents) Coverage Accuracy String Word Phrase Concept Slide is from Jimmy Lin’s tutorial

Text Representations Issues • In general, it is hard to capture these features from a text document • One, it is difficult to extract this automatically • Two, even if we did it, it won't scale! • One simplification is to represent documents as a bag of words • Each document is represented as a bag of the word it contains, and each component of the bag represents some measurement of the relative importance of a single word.

Some immediate problems • How do we compare these bags of word to find out whether they are ‘similar’? • Let’s say we have three bags: • “House, Garden, House door” • “Household, Garden, Flat” • “House, House, House, Gardening” • How do we normalise these bags? • Why is normalisation needed? • What would we want to normalise?

Keyword selection and weighting • How to select important keywords?

Luhn’s Ideas • Frequency of word occurrence in a document is a useful measurement of word significance

Zipf and Luhn

Top 50 Terms WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences) TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences)

Scholarship and the Long Tail • Scholarship follows a long-tailed distribution: the interest in relatively unknown items decline much more slowly than they would be if popularity were described by a normal distribution • We have few statistical tools for dealing with long-tailed distributions • Other problems include ‘contested terms’ Graham White, "On Scholarship" (in Bartscherer ed., Switching Codes)

Stopwords / Stoplist • Some words do not bear useful information. Common examples: of, in, about, with, I, although, … • Stoplist: contain stopwords, not to be used as index • Prepositions • Articles • Pronouns • http://www.textfixer.com/resources/common-english-words.txt

Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation Is it always good to stem? Give examples! comput Slide is from Jimmy Lin’s tutorial

Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) http://qaa.ath.cx/porter_js_demo.html • Step 1: plurals and past participles • SSES -> SS caresses -> caress • (*v*) ING -> motoring -> motor • Step 2: adj->n, n->v, n->adj, … • (m>0) OUSNESS -> OUS callousness -> callous • (m>0) ATIONAL -> ATE relational -> relate • Step 3: • (m>0) ICATE -> IC triplicate -> triplic • Step 4: • (m>1) AL -> revival -> reviv • (m>1) ANCE -> allowance -> allow • Step 5: • (m>1) E -> probate -> probat • (m > 1 and *d and *L) -> single letter controll -> control Slide is from Jimmy Lin’s tutorial

Lemmatization • transform to standard form according to syntactic category. Produce vs Produc- E.g. verb + ing  verb noun + s  noun • Need POS tagging • More accurate than stemming, but needs more resources Slide partly taken from Jimmy Lin’s tutorial

Index Documents ( Bag of Words Approach) INDEX DOCUMENT Document Analysis Text Is This This is a document in text analysis

Result of indexing • Each document is represented by a set of weighted keywords (terms): D1 {(t1, w1), (t2,w2), …} e.g. D1  {(comput, 0.2), (architect, 0.3), …} D2  {(comput, 0.1), (network, 0.5), …} • Inverted file: comput  {(D1,0.2), (D2,0.1), …} Inverted file is used during retrieval for higher efficiency. Slide partly taken from Jimmy Lin’s tutorial

Inverted Index Example Doc 1 Dictionary Postings This is a sample document with one sample sentence Doc 2 This is another sample document Slide is from ChengXiang Zhai

Similarity

Similarity Models • Boolean model • Vector-space model • Many more

Boolean model • Document = Logical conjunction of keywords • Query = Boolean expression of keywords e.g. D = t1  t2  …  tn Q = (t1 t2)  (t3 t4) Problems: • many documents or few documents • End-users cannot manipulate Boolean operators correctly E.g. documents about poverty andcrime

Vector space model • Vector space = all the keywords encountered <t1, t2, t3, …, tn> • Document D = < a1, a2, a3, …, an> ai = weight of ti in D • Query Q = < b1, b2, b3, …, bn> bi = weight of ti in Q • R(D,Q) = Sim(D,Q)

dj θ dk Cosine Similarity Similarity calculated using COSINE similarity between two vectors

Tf/Idf • tf = term frequency • frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. • df = document frequency • no. of documents containing the term • distribution of the term • idf = inverse document frequency • the unevenness of term distribution in the corpus • the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t)

Exercise (1) Define term/document matrix • D1: The silver truck arrives • D2: The silver cannon fires silver bullets • D3: The truck is on fire (2) Compute TF/IDF from Reuters

Let’s code our first text analysis engine search.pl

Our corpus • A study on Kant’s critique of judgement • Aristotle's Metaphysics • Hegel’s Aesthetics • Plato’s Charmides • McGreedy’s War Diaries • Excerpts from the Royal Irish Society

Text Analysis is an Experimental Science!

Text Analysis is an Experimental Science! • Formulate a hypothesis • Design an experiment to answer the question • Perform the experiment • Does the experiment answer the question? • Rinse, repeat…

Test Collections • Three components of a test collection: • Test Collection of documents • Set of topics • Sets of relevant document based on expert judgments • Metrics for assessing ‘performance’ • Precision • Recall

Precision vs. Recall All docs Retrieved Relevant Slide taken from Jimmy Lin’s tutorial

The TREC experiments • Once per year • A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) • Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July) • NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) • TREC conference (November)

Towards Linked Data Beyond the Simple Stuff

Tobias Blanke

Tobias Blanke

Presentation Transcript

Kent J. Blanke D.O. FACOS

CC-by: Tobias Kind (2008)

Charles D. Blanke, M.D., F.A.C.P., F.R.C.P.C.

GRUENFELDER Tobias KESSLER Lorella

Tobias Knecht

DIVINIA ANNE R. TOBIAS

Tobias Bayr and Dietmar Dommenget

Agenda Tobias Gondrom July 2011

Gesundheit Tobias

Porsch, Tobias; Schmidt, Franz

GRUENFELDER Tobias KESSLER Lorella

Tobias Möller – EWPC 2013

Porsch, Tobias; Schmidt, Franz

Jody Blanke, Professor Computer Information Systems and Law