Mining di dati web

Mining di dati web A.A 2006/2007

Il Corso • Codice:nw451 • Sigla:MDW • Crediti:6 • Orario: Mercoledì e Venerdì 16:00-18:00, aula B • Ricevimento: • Richiedere appuntamento per e-mail • c/o ISTI, Area Ricerca CNR, località San Cataldo, Pisa, ingresso 19

Docenti • Raffaele Perego raffaele.perego@isti.cnr.it, tel.0503152993 • Claudio.Lucchese claudio.lucchese@isti.cnr.it, tel.0503152967 • Fabrizio Silvestri fabrizio.silvestri@isti.cnr.it, tel.0503153011 • Diego Puppin diego.puppin@isti.cnr.it, tel.0503153011 • Antonio Panciatici antonio.panciatici@isti.cnr.it, tel.0503152967

Obiettivi del corso • Il World Wide Web (WWW) ha cambiato il modo di concepire le informazioni, di renderle fruibili e di gestirle. • Scoprire nel web informazioni non note, non banali e rilevanti è sempre più importante e difficile. • Il Web mining è quindi diventato fondamentale per l’ottimizzazione di strumenti strategici quali i siti di e-commerce, i motori di ricerca, le directory • Il corso si propone l’obiettivo di fornire strumenti e conoscenze in questo settore

Contenuti del Corso • Introduzione • Data Mining, Knowledge Discovery e il Web • Motori di Ricerca • Crawling, indexing, querying • Web Content Mining • Similarità, clustering, classificazione di testi • Web Structure Mining • Social networks, ranking, ecc. • Web Usage Mining • Recommender systems, ecc. • Argomenti avanzati (?!)

Materiale didattico • Libro di testo • Mining the Web: discovering knowledge from hypertext data. S. Chakrabarti. Morgan Kaufmann, 2003. • Libri Consigliati • Managing Gigabytes. I.H. Witten e A. Moffat e T.C. Bell. Morgan Kaufmann, 1999. • Modern Information Retrieval. R. Baeza-Yates e B. Ribeiro-Neto. Addison Wesley, 1999. • Lucidi delle lezioni e articoli • Pubblicati su http://malvasia.isti.cnr.it/~raffaele/webmining

Materiale didattico • Si ringraziano • Chakrabarti e Ramakrishnan • Per i lucidi allegati al libro di testo scaricabili all’indirizzo: http://www.cse.iitb.ac.in/~soumen/mining-the-web/ • Fosca Giannotti e Dino Pedreschi • Per i lucidi introduttivi mutuati dal corso TDM • KDNUGGETS (http://www.kdnuggets.com) • Ferragina, Attardi, Garcia Molina, ecc. • Internet :-)

Esame • Prerequisiti (consigliati) • AA270 – TDM – Tecniche di “Data Mining” – Primo Semestre. • Modalità di Esame • Il superamento dell’esame è condizionato al corretto svolgimento di un progetto (individuale o di gruppo?) e da una discussione orale sui contenuti del corso (seminario su un articolo a scelta?).

Introduzione • Data Mining e Knowledge Discovery • Ipertesti e cenni di storia del Web • Web Mining

What is DM?

Motivations for DM • Data explosion problem: • Automated data collection tools, mature database technology and internet, lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. • We are drowning in information, but starving for knowledge! (John Naisbett) • Data mining : • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large amounts of data

Motivations for DM • Abundance of business and industry data • Competitive focus - Knowledge Management • Inexpensive, powerful computing engines • Strong theoretical/mathematical foundations • machine learning & logic • statistics • database management systems • Etc.

Sources of Data (e.g.) • Business Transactions • widespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day, credit card records!!) • most important problem: effective use of the data in a reasonable time frame for competitive decision-making • e-commerce data • Scientific Data • data generated through multitude of experiments and observations • examples, geological data, satellite imaging data, NASA earth observations, CERN HEP • rate of data collection far exceeds the speed by which we analyze them • Financial Data • company information • economic data (GNP, price indexes, etc.) • stock markets

Sources of Data (e.g.) • Personal / Statistical Data • government census • medical histories • customer profiles • demographic data • data and statistics about sports and athletes • World Wide Web and Online Repositories • Billions of Web documents, images, video, etc. • emails, news, messages • link structure of the hypertext from millions of Web sites • Web usage data (from server/proxy logs, network traffic, and user registrations) • online databases, and digital libraries

Classes of DM applications • Database analysis and decision support • Market analysis • target marketing, customer relation management, market basket analysis • Risk analysis • Forecasting, customer retention, quality control, competitive analysis. • Fraud detection • Text mining • E.g. Mining opinions from email, documents

Classes of DM applications • THE WEB!! • Searching: google, askjeeves, yahoo, etc. • Social networks analysis • Web advertizing • E.g. IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior, analyzing effectiveness of Web marketing, improving Web site organization, etc. • Watch for the PRIVACY pitfall! • Many Others …. • Sports. IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. • Astronomy. JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

What is KDD? A process! • The selection and processing of data for: • the identification of novel, accurate, and usefulpatterns, and • the modeling of real-world phenomena. • Data miningis a major component of the KDD process • automated discovery of patterns and development of predictive and explanatory models.

Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources The KDD process

The KDD Process in Practice • KDD steps can be merged or combined • Data Selection + Data Transformation = Data Consolidation • Data Cleaning + Data Integration = Data Preprocessing • KDD is an Iterative Process • art + engineering rather than science

The virtuous cycle Knowledge Problem Identify Problem or Opportunity Act on Knowledge Measure effect of Action Results Strategy

The steps of the KDD process • Learning the application domain: • relevant prior knowledge and goals of application • Data consolidation: Creating a target data set • Selection and Preprocessing • Data cleaning : (may take 60% of effort!) • Data reduction and projection: • find useful features, dimensionality/variable reduction, invariant representation. • Choosing data mining methods • E.g., classification, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Interpretation and evaluation: analysis of results. • visualization, transformation, removing redundant patterns, … • Use of discovered knowledge

Roles in the KDD process

Major Data Mining Tasks • Classification: predicting an item class • Clustering: finding clusters in data • Associations:e.g. A & B & C occur frequently • Visualization: to facilitate human discovery • Summarization: describing a group • Deviation Detection: finding changes • Estimation: predicting a continuous value • Link Analysis: finding relationships • …

Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...

Clustering Find “natural” grouping of instances given un-labeled data

Association Rules & Frequent Itemsets Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)

Visualization & Data Mining • Visualizing the data to facilitate human discovery • Presenting the discovered results in a visually "nice" way

Summarization • Describe features of the selected group • Use natural language and graphics • Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...

Data Mining Central Quest Find true patterns and avoid overfitting

Overfitting • Finding seemingly significant but really random patterns due to searching too many possibilities • Violation of Occam’s razor • the explanation of any phenomenon should make as few assumptions as possible • lex parsimoniae • entia non sunt multiplicanda praeter necessitatem,

Hypertexts and the Web

World Wide Web • Hypertext documents • Text • Links • Web • billions of documents • authored by millions of diverse people • edited by no one in particular • distributed over millions of computers, connected by variety of media

History of Hypertext • Citation • Hyperlinking • Branching, non-linear discourse, nested commentary • Ramayana - one of the great epic poems of India; attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. • Mahabharata- an epic poem that recounts the struggle between the Kauravas and Pandavas over the disputed kingdom of Bharata, the ancient name for India • Talmud - compilation of Jewish oral teachings, assembled in written form in the early centuries of the Christian era • Dictionary, encyclopedia • self-contained networks of textual nodes • joined by referential links

Hypertext systems • Memex, 1945 [Vannevar Bush, US President Roosevelt's science advisor] • stands for “memory extension” • Aim: to create and help follow hyperlinks across documents • photoelectrical-mechanical storage and computing device that could store vast amounts of information, in which a user had the ability to create links of related text and illustrations. This trail could then be stored and used for future reference. Bush believed that using this associative method of information gathering was not only practical in its own right, but was closer to the way the mind ordered information."

Hypertext systems • Hypertext, term coined by Ted Nelson in a 1965 paper to the ACM 20th national conference: • [...] By 'hypertext' mean nonsequential writing - text that branches and allows choice to the reader, best read at an interactive screen.

Hypertext systems • The first hypertext-based system was developed in 1967 by a team of researchers led by Dr. Andries van Dam at Brown University. • The research was funded by IBM and the first hypertext implementation, Hypertext Editing System, ran on an IBM/360 mainframe. • IBM later sold the system to the Houston Manned Spacecraft Center which reportedly used it for the Apollo space program documentation

Hypertext systems • Xanadu hypertext, by Ted Nelson, 1981: • In the Xanadu scheme, a universal document database (docuverse), would allow addressing of any substring of any document from any other document. "This requires an even stronger addressing scheme than the Universal Resource Locators used in the World-Wide Web." [De Bra] • Additionally, Xanadu would permanently keep every version of every document, thereby eliminating the possibility of a broken link. Xanadu would only maintain the current version of the document in its entirety.

World-wide Web • Initiated at CERN in 1989 • By Tim Berners-Lee, now w3c director: • “W3 was originally developed to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups. Originally aimed at the High Energy Physics community, it has spread to other areas and attracted much interest in user support, resource discovery and collaborative work areas. It is currently the most advanced information system deployed on the Internet, and embraces within its data model most information in previous networked information systems.”

World-wide Web • GUIs • Berners-Lee (WorldWideWeb - 1990) • Erwise and Viola(1992), Midas (1993) • Mosaic (1993) • a hypertext GUI for the X-window system • HTML: markup language for rendering hypertext • HTTP: hypertext transport protocol for sending HTML and other data over the Internet • CERN HTTPD: server of hypertext documents

The early days of the Web : CERN HTTP traffic grows by 1000between 1991-1994 (image courtesy W3C)

The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)

1994: the landmark year • Foundation of the “Mosaic Communications Corporation” (later Nestcape) • first World-Wide Web conference • MIT and CERN agreed to set up the World-wide Web Consortium (W3C).

The Web • A populist, participatory medium • number of writers =(approx) number of readers. • enables near-zero-cost dissemination of information • Abundance and authority crisis • liberal and informal culture of content generation and dissemination. • Very little uniform civil code. • redundancy and non-standard form and content. • millions of qualifying pages for most broad queries • Example: java or kayaking • no per se authoritative information about the reliability of a site

Problems due to Uniform accessibility • little support for adapting to the background of specific users. • commercial interests routinely influence the operation of Web search • Users pay for connection costs, not for contents • Profit depends from ads, sales, etc • “Search Engine Optimization“ !!

What is Web Mining? Discovering interesting and useful information from Web content, structure and usage Examples: • Web search, e.g. Google, Yahoo, MSN, Ask, … • Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog) • eCommerce : • Recommendations: e.g. Netflix, Amazon • improving conversion rate: next best product to offer • Advertising, e.g. Google Adsense • Fraud detection: click fraud detection, … • Improving Web site design and performance

How does it differ from “classical” Data Mining? • The web is not a relation • Textual information and linkage structure • Usage data is huge and growing rapidly • Google’s usage logs are bigger than their web crawl • Data generated per day is comparable to largest conventional data warehouses • Content and structure data rich in features and patterns • spontaneous formation and evolution of • topic-induced graph clusters • hyperlink-induced communities • Ability to react in real-time to usage patterns • No human in the loop Reproduced from Ullman & Rajaraman with permission

How big is the Web ? • Number of pages • Technically, infinite • Because of dynamically generated content • Lots of duplication (30-40%) • Best estimate of “unique” static HTML pages comes from search engine claims • Google = 8 billion, Yahoo = 20 billion • Lots of marketing hype Reproduced from Ullman & Rajaraman with permission

96,854,877 web sites (Sept 2006) http://news.netcraft.com/archives/web_server_survey.html Total Sites Across All Domains August 1995 - September 2006

The web as a graph • Pages = nodes, hyperlinks = edges • Ignore content • Directed graph • High linkage • 8-10 links/page on average • Power-law degree distribution Reproduced from Ullman & Rajaraman with permission

Mining di dati web