Text Retrieval and Mining + Software Engineering =?

Text Retrieval and Mining + Software Engineering =? ChengXiang(“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign The Sixth International Workshop on Software Mining, Oct. 30, 2017, Urbana, IL

Text data cover all kinds of topics on the Web Topics: People Events Products Services, … … Sources: Blogs Microblogs Forums Reviews ,… 65M msgs/day 53M blogs 1307M posts 45M reviews 115Musers 10M groups …

Text Data in Software Engineering • Throughout the life cycle of a software, we encounter all kinds of text data, e.g., • Requirements, specification • Software documentation, comments • Communications among team members • After a product is put in use, we further encounter other kinds of text data, e.g., • Bug reports • Software reviews • Forum discussions • Also, literature in software engineering • … How can we manage and exploit all these data to improve software productivity, software quality, and user experience ?

Main Techniques for Harnessing Big Text Data: Text Retrieval + Text Analysis Text Retrieval Text Mining Big Text Data Big Text Data Small Relevant Data Small Relevant Data Knowledge Many Applications

Conceptual Framework for Text Retrieval & Mining: Text Information Systems (TIS) Users Retrieval Applications Summarization Visualization Analytics Applications Filtering Topic Analysis Information Organization Information Access Text Mining Search Extraction Categorization Clustering Natural Language Content Analysis Text Text Acquisition

Elements of TIS: Natural Language Content Analysis • Natural Language Processing (NLP) is the foundation of TIS • Enable understanding of meaning of text • Provide semantic representation of text for TIS • Current NLP techniques mostly rely on statistical machine learning enhanced with limited linguistic knowledge • Shallow techniques are robust, but deeper semantic analysis is only feasible for very limited domain • Some TIS capabilities require deeper NLP than others • Most text information systems use very shallow NLP (“bag of words” representation)

Elements of TIS: Text Access • Search:take a user’s query and return relevant documents • Filtering/Recommendation: monitor an incoming stream and recommend to users relevant items (or discard non-relevant ones) • Categorization: classify a text object into one of the predefined categories • Summarization:take one or multiple text documents, and generate a concise summary of the essential content

Elements of TIS: Text Mining • Topic Analysis: take a set of documents, extract and analyze topics in them • Information Extraction:extract entities, relations of entities or other “knowledge nuggets” from text • Clustering: discover groups of similar text objects (terms, sentences, documents, …) • Visualization: visually display patterns in text data

Outline • Overview of Natural Language Processing (NLP) • Main Techniques for Text Retrieval and Mining • Selected Projects in Text Retrieval and Mining • Applications of Text Retrieval and Mining in Software Engineering • Summary

An Example of NLP Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Semantic analysis Noun Phrase Noun Phrase Complex Verb Prep Phrase Verb Phrase Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + Verb Phrase Scared(x) if Chasing(_,x,_). Sentence A person saying this may be reminding another person to get the dog back… Scared(b1) Inference Pragmatic analysis (speech act) Lexical analysis (part-of-speech tagging) A dog is chasing a boy on the playground Syntactic analysis (Parsing)

NLP Is Difficult! • Natural language is designed to make human communication efficient. As a result, • we omit a lot of “common sense” knowledge, which we assume the hearer/reader possesses • we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve • This makes EVERY step in NLP hard • Ambiguity is a “killer”! • Common sense reasoning is pre-required

Examples of Challenges • Word-level ambiguity: E.g., • “design” can be a noun or a verb (Ambiguous POS) • “root” has multiple meanings (Ambiguous sense) • Syntactic ambiguity: E.g., • “natural language processing” (Modification) • “A man saw a boy with a telescope.” (PP Attachment) • Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) • Presupposition: “He has quit smoking.” implies that he smoked before.

The State of the Art A dog is chasing a boy on the playground POS Tagging: 97% Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Noun Phrase Complex Verb Prep Phrase Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Sentiment analysis Verb Phrase Parsing: partial >90%(?) Verb Phrase Sentence Speech act analysis: ??? Inference: ???

What We Can’t Do • 100% POS tagging • “He turned off the highway.” vs “He turned off the fan.” • General complete parsing • “A man saw a boy with a telescope.” • Precise deep semantic analysis • What’s the meaning of “own” in “John owns a restaurant.”? Robust & general NLP tends to be “shallow” while “deep” understanding doesn’t scale up

How Much NLP is Really Needed? “Easier” & More “workarounds” Tasks Dependency on NLP Classification/ Retrieval Summarization/ Extraction/ Topic Mining Translation/ Dialogue Question Answering

Workaround Example I: ELIZA How does it work? • Heuristic rules based on • Keyword matching! E.g., • I remember X => • Do you often think of X? always => Can you think of a specific example? It keeps the dialogue going on, but NEVER makes any claim! A breakdown example: Person : Failure is the mother of success. ELIZA : Tell me more about your family ELIZA plays the role of a therapist Person : Men are all alike. ELIZA : In what way? Person : They're always bugging us about something or other. ELIZA : Can you think of a specific example? Person : Well, my boyfriend made me come here. ELIZA : Your boyfriend made you come here? Person : He says I'm depressed much of the time. ELIZA : I am sorry to hear that you are depressed. Person : It's true. I am unhappy. ELIZA : Do you think coming here will help you not to be unhappy? Person : I need some help, that much seems certain. ELIZA : What would it mean to you if you got some help? Person : Perhaps I could learn to get along with my mother. ELIZA : Tell me more about your family No NLP, but useful. Perhaps we should call this NLP? Statistical NLP often has a similar flavor with “SOFT” rules LEARNED from data

Workaround Example II: Statistical Translation All these intuitions are captured through a probabilistic model Chinese Words(C) English Words (E) Translator English Translation English Speaker Noisy Channel P(E|C)=? P(E) P(C|E) • Learn how to translate Chinese to English from many example translations • Intuitions: • If we have seen all possible translations, then we simply lookup • If we have seen a similar translation, then we can adapt • If we haven’t seen any example that’s similar, we try to generalize what we’ve seen

NLP for Text Retrieval and Mining • Must be general, robust & efficient  Shallow NLP • “Bag of words” representation tends to be sufficient for most search and mining tasks (but not all!) • Some text retrieval techniques can naturally address NLP problems (e.g., word sense disambiguation based on other words in query) • However, deeper NLP is needed for complex search tasks • Most useful NLP techniques are usually based on statistical language models (to capture patterns in text data) and supervised machine learning (to leverage human expertise)

Access to Relevant Text Data: Text Retrieval User Small Relevant Data Text Access Recommender System Push Pull Search Engine Querying + Browsing + 1. Natural Language Content Analysis Big Text Data

Two Modes of Text Access: Pull vs. Push • Pull Mode (search engines) • Users take initiative • Ad hoc information need • Push Mode (recommender systems) • Systems take initiative • Stable information need or system has good knowledge about a user’s need

Pull Mode: Querying vs. Browsing • Querying • User enters a (keyword) query • System returns relevant documents • Works well when the user knows what keywords to use • Browsing • User navigates into relevant information by following a path enabled by the structures on the documents • Works well when the user wants to explore information, doesn’t know what keywords to use, or can’t conveniently enter a query

Information Seeking as Sightseeing • Sightseeing: Know address of an attraction? • Yes: take a taxi and go directly to the site • No: walk around or take a taxi to a nearby place then walk • Information seeking: Know exactly what you want to find? • Yes: use the right keywords as a query and find the information directly • No: browse the information space or start with a rough query and then browse

Text Mining and Analytics • Text mining  Text analytics • Turn text data into high-quality information or actionable knowledge • Minimizes human effort (on consuming text data) • Supplies knowledge for optimal decision making • Related to text retrieval, which is an essential component in any text mining system • Text retrieval can be a preprocessor for text mining • Text retrieval is needed for knowledge provenance

Humans as Subjective& Intelligent “Sensors” Report Sense Real World Data Sensor Weather Thermometer 3C , 15F, … Geo Sensor 41°N and 120°W …. Locations Perceive Express Network Sensor 01000100011100 Networks “Human Sensor”

Unique Value of Text Data • Useful to all big data applications • Especially useful for mining knowledge about people’sbehavior, attitude, and opinions • Directly express knowledge about our world: Small text data are also useful! Data  Information  Knowledge Text Data

Opportunities of Text Mining Applications + Non-Text Data 4. Infer other real-world variables (predictive analytics) 2. Mining content of text data + Context Observed World Text Data Real World Express Perceive 1. Mining knowledge about language 3. Mining knowledge about the observer (Perspective) (English)

TextScope to enhance human perception TextScope Microscope Telescope • Intelligent Interactive Retrieval & Text Analysis • for Task Support and Decision Making

TextScopein Action: intelligent interactive decision support Predictive Model Predicted Values of Real World Variables Multiple Predictors (Features) TextScope … Learning to interact Domain Knowledge Optimal Decision Making Prediction Joint Mining of Non-Text and Text Text + Non-Text … Sensor 1 Non-Text Data Real World Interactive text analysis Sensor k … Text Data Interactive information retrieval Natural language processing

TextScope= Intelligent & InteractiveInformation Retrieval+ Text Mining Task Panel TextScope … Prediction Opinion Topic Analyzer Event Radar Search Box Microsoft (MSFT,) Google, IBM (IBM) and other cloud-computing rivals of Amazon Web Services are bracing for an AWS "partnership" announcement with VMware expected to be announced Thursday. … MyFilter1 MyFilter2 … Select Time Select Region My WorkSpace Project 1 Alert A Alert B ...

Sample Project 1: User-Centered Adaptive IR (UCAIR) • A novel retrieval strategy emphasizing • user modeling (“user-centered”) • search context modeling (“adaptive”) • interactive retrieval • Implemented as a personalized search agent that • sits on the client-side (owned by the user) • integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users) • collaborates with each other • goes beyond search toward task support

Non-Optimality of Document-Centered Search Engines As of Oct. 17, 2005 Car Car Software Car Animal Car Query = Jaguar Mixed results, unlikely optimal for any particular user

The UCAIR Project WEB Email ... Viewed Web pages Query History Search Engine Search Engine Personalized search agent Search Engine “jaguar” Personalized search agent Desktop Files “jaguar”

Potential Benefit of Personalization Suppose we know: • Previous query = “racing cars” vs. “Apple OS” • “car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days 3. User just viewed an “Apple OS” document Car Car Software Car Animal Car

Intelligent Re-ranking of Unseen Results When a user clicks on the “back” button after viewing a document, UCAIR reranks unseen results to pull up documents similar to the one the user has viewed

UCAIR Outperforms Google[Shen et al. 05] Precision at N documents UCAIR toolbar available at http://sifaka.cs.uiuc.edu/ir/ucair/

Future: Personal Information Agent Desktop WWW Intranet Email User Profile Active Info Service E-COM IM … Task Support Security Handler Personal Content Index Sports Blog Frequently Accessed Info … Literature

Sample Project 2: Multi-Resolution Topic Map for Browsing • Promoting browsing as a “first-class citizen” • Multi-resolution topic map for browsing • Enable a user to find information through navigation • Very useful when a user can’t formulate effective queries or uses a small screen device • Search log as information footprints • Organize search log into a topic map • Allow a user to follow information footprints of previous users • Enable social surfing

Querying vs. Browsing

Information Seeking as Sightseeing • Know the address of an attraction site? • Yes: take a taxi and go directly to the site • No: walk around or take a taxi to a nearby place then walk around • Know what exactly you want to find? • Yes: use the right keywords as a query and find the information directly • No: browse the information space or start with a rough query and then browse When query fails, browsing comes to rescue…

Current Support for Browsing is Limited • Hyperlinks • Only page-to-page • Mostly manually constructed • Browsing step is very small • Web directories • Manually constructed • Fixed categories • Only support vertical navigation Beyond hyperlinks? ODP Beyond fixed categories? How to promote browsing as a “first-class citizen”?

Sightseeing Analogy Continues… Horizontal navigation Region Zoom in Zoom out

Topic Map for Touring Information Space Topic regions Multiple resolutions Zoom in 0.03 0.05 0.03 0.02 0.01 Zoom out Horizontal navigation

Topic-Map based Browsing Demo

How can we construct such a multi-resolution topic map? Multiple possibilities…

Search Logs as Information Footprints Footprints in information space User 2722 searched for "national car rental" [!] at 2006-03-09 11:24:29 User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.valoans.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://benefits.military.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.avis.com) User 2722 searched for "enterprise rent a car" [!] at 2006-04-05 23:37:42 (found http://www.enterprise.com) User 2722 searched for "meineke car care center" [!] at 2006-05-02 09:12:49 (found http://www.meineke.com) User 2722 searched for "car rental" [!] at 2006-05-25 15:54:36 User 2722 searched for "autosave car rental" [!] at 2006-05-25 23:26:54 (found http://eautosave.com) User 2722 searched for "budget car rental" [!] at 2006-05-25 23:29:53 User 2722 searched for "alamo car rental" [!] at 2006-05-25 23:56:13 ……

Information Footprints  Topic Map • Challenges • How to define/construct a topic region • How to control granularities/resolutions of topic regions • How to connect topic regions to support effective browsing • Two approaches • Multi-granularity clustering • Query editing

Collaborative Surfing New queries become new footprints Navigation trace enriches map structures Clickthroughs become new footprints Browse logs offer more opportunities to understand user interests and intents

Sample Project 3:Contextual Text Mining • Documents are often associated with context (meta-data) • Direct context: time, location, source, authors,… • Indirect context: events, policies, … • Many applications require “contextual text analysis”: • Discovering topics from text in a context-sensitive way • Analyzing variations of topics over different contexts • Revealing interesting patterns (e.g., topic evolution, topic variations, topic communities)

Text Retrieval and Mining + Software Engineering =?