Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

IST 511 Information Management: Information and Technology Information extraction, data mining, metadata Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology The Pennsylvania State University, University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Special thanks to E. Agichtein, K. Borne, S. Sarawagi, C. Lagoze,

Last time • What are probabilities • What is information theory • What is probabilistic reasoning • Definitions • Why important • How used – decision making • Decision trees • Impact on information science

Today • What is information extraction • What is data mining • Text mining as subfield • What is metadata • Impact on information science

Tomorrow • Topics used in IST • Digital libraries, • Scientometrics, bibliometrics • Digital humanities

Theories in Information Sciences • Enumerate some of these theories in this course. • Issues: • Unified theory? • Domain of applicability • Conflicts • Theories here are • Very algorithmic • Some quantitative • Some qualitative • Quality of theories • Occam’s razor • Subsumption of other theories (all can use machine learning) • Text mining special case of data mining • Natural language processing uses data mining methods • Theories • Natural language processing

Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today:data science (eScience) unify theory, experiment, and simulation Data captured by instrumentsor generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes database / filesusing data management and statistics

Information extraction, data mining and natural language processing Natural language processing is the processing and understanding of human language by machines Information Extraction can be considered a subclass Also known as knowledge extraction Data mining is the process of discovering new patterns from large data sets Text mining is the data mining of text Text analytics generally refers to the tools used Information extraction is the process of extracting and labeling relevant data from large data sets, usually text Large means manually unreasonable

The Value of Unstructured Text Data • “Unstructured” text data is the primary form of human-generated information • Business and government reports, blogs, web pages, news, scientific literature, online reviews, … • Need to extract information and give it structure to effectively manage, search, mine, store and utilize this data • Information Extraction: maturing, and active research area • Software and companies exist • Intersection of Computational Linguistics, Machine Learning, Data mining, Databases, and Information Retrieval • Active crawling for text data

Example: Answering Queries Over Text For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Select Name From PEOPLE Where Organization = ‘Microsoft’ PEOPLE Name Title Organization Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanFounderFree Soft.. Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003)

Information extraction from text or pdfs For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Select Name From PEOPLE Where Organization = ‘Microsoft’ PEOPLE Name Title Organization Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanFounderFree Soft.. XML or database For extraction of OAI metadata from academic documents, see CiteSeerX citeseerx.ist.psu.edu (William Cohen’s IE tutorial, 2003)

Information Extraction Tasks • Extracting entities and relations: this talk • Entities: named (e.g., Person) and generic (e.g., disease name) • Relations: entities related in a predefined way (e.g., Location of a Disease outbreak, or a CEO of a Company) • Events: can be composed from multiple relation tuples • Common extraction subtasks: • Preprocess: sentence chunking, syntactic parsing, morphological analysis • Create rules or extraction patterns: hand-coded, machine learning, and hybrid • Apply extraction patterns or rules to extract new information • Postprocess and integrate information • Co-reference resolution, deduplication, disambiguation

Entities • Wikipedia: An entity is something that has a distinct, separate existence, although it need not be a material existence. • Features: • Permanent vs transient • Unique vs common • Animate vs inanimate • Small vs large • Mobile vs sessile • Place vs thing • Abstract vs real • Bio labels • Digital mention or reference

Useful for data warehousing, data cleaning, web data integration Example: Extracting Entities from Text House number Zip State Building Road City Address 4089 Whispering Pines Nobel Drive San Diego CA 92122 1 Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002 Citation

Entity Disambiguation • Task of clustering and linking similar entities in a document or between documents. • Labels sometime complex are given to these entities • Sometimes includes task of extracting or finding those entities (information extraction, focused crawling, etc)

Hand-Coded Methods ContactPattern  RegularExpression(Email.body,”can be reached at”) • Easy to construct in some cases • e.g., to recognize prices, phone numbers, zip codes, conference names, etc. • Intuitive to debug and maintain • Especially if written in a “high-level” language: • Can incorporate domain knowledge • Scalability issues: • Labor-intensive to create • Highly domain-specific • Often corpus-specific • Rule-matches can be expensive [IBM Avatar]

Entity Disambiguation by some other name? • record linkage • merge/purge processing or list washing • data matching • object identity problem • named entity resolution • duplicate detection • record matching • instance identification • deduplication • coreference resolution • reference reconciliation • database hardening • Closely related to Natural Language Processing

Entity Disambiguation Applications • Speech understanding • Question/answering • Health records • Criminal activities • Finance records • Semantic web applications • Scientific discovery and search • Semantic search • Others?

Entity Tagging • Identifying mentions of entities (e.g., person names, locations, companies) in text • MUC (1997): Person, Location, Organization, Date/Time/Currency • ACE (2005): more than 100 more specific types • Hand-coded vs. Machine Learning approaches • Best approach depends on entity type and domain: • Closed class (e.g., geographical locations, disease names, gene & protein names): hand coded + dictionaries • Syntactic (e.g., phone numbers, zip codes): regular expressions • Semantic (e.g., person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc. • “Almost solved” for common/typical entity types

Machine Learning Methods The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba] • Can work well when lots of training data and easy to construct • Can capture complex patterns that are hard to encode with hand-crafted rules • e.g., determine whether a review is positive or negative • extract long complex gene names • Non-local dependencies

Representation Models [Cohen and McCallum, 2003] Classify Pre-segmentedCandidates Lexicons Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. member? Classifier Classifier Alabama Alaska … Wisconsin Wyoming which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S …and beyond Any of these models can be used to capture words, formatting or both.

(Person) Name Disambiguation • Person Name disambiguation • A person can be referred to in different ways with different attributes in multiple records, the goal of name disambiguation is to resolve such ambiguities, linking and merging all the records of the same entity together • Large # of mentions and entities • Consider three types of person name ambiguities: • Aliases - one person with multiple aliases, name variations, or name changed e.g. CL Giles & Lee Giles, Superman & Clark Kent • Common Names - more than one person shares a common name, e.g. Jian Huang – 118 papers in DBLP • Typography Errors - resulting from human input or automatic extraction • Goal: disambiguate, cluster and link names in a large digital library or bibliographic resource such as Medline

Popular Machine Learning Methods For details: [Feldman, 2006 and Cohen, 2004] • Naive Bayes • SRV [Freitag 1998], Inductive Logic Programming • Rapier [Califf and Mooney 1997] • Hidden Markov Models [Leek 1997] • Maximum Entropy Markov Models [McCallum et al. 2000] • Conditional Random Fields [Lafferty et al. 2001] • Scalability • Can be labor intensive to construct training data • At run time, complex features can be expensive to construct or process (batch algorithms can help: [Chandel et al. 2006] )

Data mining? • Process of semi-automatically analyzing large data sets and databases to find patterns that are: • valid: hold on new data with some certainity • novel: non-obvious to the system • useful: should be possible to act on the item • understandable: humans should be able to interpret the pattern

Evolution of Data Mining<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Data Mining is Ready for Prime Time • Data mining is ready for general application because it engages three technologies that are now sufficiently mature: • Massive data collection & delivery • Powerful multiprocessor computers • Sophisticated data mining algorithms

Organizational Reasons to use Data Mining • Most organizations already collect and refine massive quantities of data. • Their most important information is in their data warehouses. • Data mining moves beyond the analysis of past events … to predicting future trends and behaviors that may be missed because they lie outside the experts’ expectations. • Data mining tools can answer complex business questions that traditionally were too time-consuming to resolve. • Data mining tools can explore the intricate interdependencies within databases in order to discover hidden patterns and relationships. • Data mining allows decision-makers to make proactive, knowledge-driven decisions.

A Key Concept for Data Mining • Data Mining delivers actionable data: • data that support decision-making • data that lead to knowledge and understanding • data with a purpose • i.e., Data do not exist for their own sake. • The Data Warehouse is a corporate asset (whether in business, marketing, banking, science, telecommunications, entertainment, computer security, or security).

Data Mining - the up side • Data mining is everywhere: • Huge scientific databases (NASA, Human Genome,…) • Corporate databases (OLAP) • Credit card usage histories (Capital One) • Loan applications (Credit Scoring) • Customer purchase records (CRM) • Web traffic analysis (Doubleclick) • Network security intrusion detection (Silent Runner) • The hunt for terrorists • The NBA!

Data Mining - the down side • Data mining is a pejorative in the business database community (“data dredging”) • They prefer to call it Knowledge Discovery, or Business Intelligence, or CRM (Customer Relationship Management), or Marketing, or OLAP (On-Line Analytical Processing) • Legal issues in many countries • The Data Mining Moratorium Act of 2003 • debated within the U.S.Congress • privacy concerns • directly primarily against the DARPA TIA Program (Total Information Awareness)

Characteristics of The Information Age: • Data “Avalanche” • the flood of Terabytes of data is already happening, whether we like it or not • our present techniques of handling these data do not scale well with data volume • Distributed Digital Archives • will be the main access to data • will need to handle hundreds to thousands of queries per day • Systematic Data Exploration and Data Mining • will have a central role • statistical analysis of “typical” events • automated search for “rare” events

The Data Flood is Everywhere • Huge quantities of data are being generated in all business, government, and research domains: • Banking, retail, marketing, telecommunications, other business transactions ... • Scientific data: genomics, astronomy, biology, etc. • Web, text, and e-commerce

Data Growth Rate Exabytes 10-fold Growth in 5 Years! DVD RFID Digital TV MP3 players Digital cameras Camera phones, VoIP Medical imaging, Laptops, Data center applications, Games Satellite images, GPS, ATMs, Scanners Sensors, Digital radio, DLP theaters, Telematics Peer-to-peer, Email, Instant messaging, Videoconferencing, CAD/CAM, Toys, Industrial machines, Security systems, Appliances Source: IDC, 2008

What is Data Mining? • Data mining is defined as “an information extraction activity whose goal is to discover hidden facts contained in (large) databases." • Data mining is used to find patterns and relationships in data. (EDA = Exploratory Data Analysis) • Patterns can be analyzed via 2 types of models: • Descriptive : Describe patterns and create meaningful subgroups or clusters. • Predictive : Forecast explicit values, based upon patterns in known results. • How does this become useful (not just bits of data)? ... • … through KNOWLEDGE DISCOVERY • Data  Information  Knowledge  Understanding / Wisdom!

Historical Note: Many Names of Data Mining • Data Fishing, Data Dredging: 1960- • used by Statisticians (as a bad name) • Data Mining :1990- • used by DB & business communities • in 2003 – bad image because of DARPA TIA • Knowledge Discovery in Databases (1989-) • used by AI & Machine Learning communities • also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently: Data Mining and Knowledge Discovery are seemed to be used interchangeably.

Relationship with other fields • Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on • scalability of number of features and instances • stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. • automation for handling large, heterogeneous data

Some basic operations • Predictive: • Regression • Classification • Collaborative Filtering • Descriptive: • Clustering / similarity matching • Association rules and variants • Deviation detection

Data Mining Examples • Classic Textbook Example of Data Mining(Legend?): Data mining of grocery store logs indicated that men who buy diapers also tend to buy beer at the same time. • Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. • A financial institution discovered that credit applicants who used pencil on the form were much more likely to default on their debts than those who filled out the application using ink. • Credit card companies recommend products to cardholders based on analysis of their monthly expenditures. • Airline purchase transaction logs revealed that 9-11 hijackers bought one-way airline tickets with the same credit card. • Astronomers examined objects with extreme colors in a huge database to discover the most distant Quasars ever seen.

Data Mining Application:Marketing • Sales Analysis • associations between product sales: • beer and diapers • strawberry pop tarts and beer (and hurricanes) • Customer Profiling • data mining can tell you what types of customers buy what products • Identifying Customer Requirements • identify the best products for different customers • use prediction to find what factors will attract new customers

Data Mining Application:Fraud Detection • Auto Insurance Fraud • Association Rule Mining can detect a group of people who stage accidents to collect on insurance • Money Laundering • Since 1993, the US Treasury's Financial Crimes Enforcement Network agency has used a data-mining application to detect suspicious money transactions • Banking: Loan Fraud • Security Pacific/Bank of America uses data mining to help with commercial lending decisions and to prevent fraud

The Necessity of Data Mining • Enormous interest in these data collections. • The environment to exploit these data does not exist! • 1 Terabyte at 100 Mbits/sec takes 1 day to transfer. • Hundreds to thousands of queries per day. • Data will reside at multiple locations, in many different formats. • Existing analysis tools do not scale to Terabyte data collections. • The need is acute! A solution will not just happen.

What is Knowledge Discovery? • Knowledge discovery refers to “finding out new knowledge about an application domain using data on the domain usually stored in a database.” • Application domains: scientific, customer purchase records, computer network logs, web traffic logs, financial transactions, census data, basketball play-by-play histories, ... • Why are Data Mining & Knowledge Discovery such hot topics? --- because of the enormous interest in these huge databases and their potential for new discoveries. • In large databases, Data Mining and Knowledge Discovery come in two flavors: • Event-based mining • Relationship-based mining

Event-Based Mining • (Event-based mining is based upon events or trends in data.) • Four distinct orthogonal categorizations: • Known events / known models - use existing models (descriptive models) to locate known phenomena of interest either spatially or temporally within a large database. • Known events / unknown models - use clustering properties of data to discover new relationships and patterns among known phenomena. • Unknown events / known models - use known associations and relationships (predictive models) among parameters that describe a phenomenon to predict the presence of previously unseen examples of the same phenomenon within a large complex database. • Unknown events / unknown models - use thresholds or trends to identify transient or otherwise unique ("one-of-a-kind") events and therefore to discover new phenomena.  Serendipity!

Relationship-Based Data Mining(Based upon associations & relationships among data items) • Spatial associations -- identify events or objects at the same physical spatial location, or at related locations (e.g., urban versus rural data). • Temporal associations -- identify events or transactions occurring during the same or related periods of time (e.g., periodically, or N days after event X). • Coincidence associations -- use clustering techniques to identify events that are co-located (that coincide) within a multi-dimensional parameter space.

User Requirements for a Data Mining System(What features must a DM system have for users?) • Cross-Identification - refers to the classical problem of associating the objects listed in one database to the objects listed in another. • Cross-Correlation - refers to the search for correlations, tendencies, and trends between parameters in multi-dimensional data, usually across databases. • Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a single database. • Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to one or more databases in the hope of making a serendipitous discovery of new events/objects or a new class of events/objects.

Representative Data Mining Architecture<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Data leads to Knowledge leads to Understanding Data  Information  Knowledge  Understanding / Wisdom! • EXAMPLE : • Data = 00100100111010100111100 (stored in database) • Information = ages and heights of children (metadata) • Knowledge = the older children tend to be taller • Understanding = children’s bones grow as they get older

Astronomy Example Data: (a) Imaging data (ones & zeroes) (b) Spectral data (ones & zeroes) • Information (catalogs / databases): • Measure brightness of galaxies from image (e.g., 14.2 or 21.7) • Measure redshift of galaxies from spectrum (e.g., 0.0167 or 0.346) • Knowledge: • Hubble Diagram  • Redshift-Brightness Correlation  • Redshift = Distance Understanding: the Universe is expanding!!

Goal of Data Mining • The end goal of data mining is not the data themselves, but the new knowledge and understanding that are revealed in the process = Business Intelligence (BI). (Remember what we said about the business community’s opinion of D.M.) • This is why the research field is usually referred to asKDD = Knowledge Discovery in Databases.

Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology