870 likes | 1.62k Vues
WEB MINING AND APPLICATIONS. Pallavi Tripathi 105956127 Vaishali Kshatriya 105951122 Mehru Anand 106113525 Minnie Virk 106113516. REFERENCES.
E N D
WEB MINING AND APPLICATIONS Pallavi Tripathi 105956127 Vaishali Kshatriya 105951122 Mehru Anand 106113525 Minnie Virk 106113516
REFERENCES • Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber • Presentation Slides of Prof. Anita Wasilewska • http://www.cs.rpi.edu/~youssefi/research/VWM/ • http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf • http://www.galeas.de/webimining.html • http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps CSE:634 Web Mining
CITATIONS • Amir H. Youssefi, David J. Duke, Mohammed J. Zaki, Ephraim P. Glinert, Visual Web Mining 13th International World Wide Web Conference (poster proceedings), New York, NY, May 2004. • Amir H. Youssefi, David Duke, Ephraim P. Glinert, and Mohammed J. Zaki, Toward Visual Web Mining, 3rd International Workshop on Visual Data Mining (with ICDM'03), Melbourne, FL, November 2003. CSE:634 Web Mining
With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating serverside and clientside intelligent systems that can effectively mine for knowledge http://www.galeas.de/webimining.html CSE:634 Web Mining
WHAT IS WEB MINING? Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the WorldWide Web. CSE:634 Web Mining
AREAS OF CLASSIFICATION • WEB CONTENT MINING is the process of extracting knowledge from the content of documents or their descriptions. • WEB STRUCTURE MINING is the process of inferring knowledge from the WorldWide Web organization and links between references and referents in the Web. • WEB USAGE MINING, also known as WEB LOG MINING, is the process of extracting interesting patterns in web access logs • Inaddition to these three web mining types, there are other helpful approaches for web knowledge discovery, such as information visualization which helps us to understand the complex relationships and structures of many search results. http://www.galeas.de/webimining.html CSE:634 Web Mining
TOPICS COVERED In today’s presentation we would be covering the following algorithms related to the various aspects of Web Mining : • Spade Algorithm and its applications in Visual Web Mining • Sentiment Classification • Community Trawling Algorithm CSE:634 Web Mining
VISUAL WEB MINING Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain. Application Domain is Web Usage Mining and Web Content Mining http://www.cs.rpi.edu/~youssefi/research/VWM/ CSE:634 Web Mining
APPROACH USED • Make personalized results for targeted web surfers • Use data mining algorithms for extracting new insight and measures • Employ a database server and relational query language as a means to submit specific queries against data • Utilize visualization to obtain an overall picture http://www.cs.rpi.edu/~youssefi/research/VWM/ CSE:634 Web Mining
SPADE OVERVIEW • Proposed by Mohammed J Zaki • Sequential PAttern Discovery Using Equivalent Class • An algorithm based on Apriori for fast discovery of frequent sequences • Needs three database scans in order to extract sequential patterns • Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction. • The aim is to obtain typical behaviors according to the user's viewpoint. http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
DEFINITIONS • Item : Can be considered as the object bought by a customer, or the page requested by the user of a website, etc. • Itemset: An itemset is the set of items that are grouped by timestamp. • Data Sequence: Sequence of itemsets associated to a customer. • Sequential Mining: Discovering frequent sequences over time of attribute sets in large databases. • Frequent Sequential Pattern: Sequence whose statistical significance in the database is above user-specified threshold. http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
SPADE ALGORITHM • In the first scan ,find frequent items • The second scan aims at finding frequent sequences of length 2 • The last scan associates to frequent sequences of length 2, a table of the corresponding sequences id and itemsets id in the database • Based on this representation in main memory, the support of the candidate sequences of length k is the result of join operations on the tables related to the frequent sequences of length (k-1) able to generate this candidate http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
Data Sequence of 4 customers http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
AN EXAMPLE • With a minimum support of “50%” a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4). • In this case a maximal sequential pattern mining process will find three patterns: S1: (“Camera,DVD”)(“DVD-R,DVD-Rec”) S2: (“DVD-R,DVD-Rec”)(“Videosoft”) S3: (“Memory Card”)(“USB”) http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
Determining Support SUFFIX JOIN ON ID LIST ORIGINAL ID LIST DATABASE http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf CSE:634 Web Mining
ADVANTAGES • Uses simple join operations on id table • No complicated hash tree structures used • No overhead of generating and searching subsequences • Cuts down on I/O operations by limiting itself to three scans http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps CSE:634 Web Mining
The visual Web Mining Framework provides prototype implementation for applying information visualization techniques on these results. http://www.cs.rpi.edu/~youssefi/research/VWM/ CSE:634 Web Mining
SYSTEM ARCHITECTURE http://www.cs.rpi.edu/~youssefi/research/VWM CSE:634 Web Mining
A robot (webbot) is used to retrieve the pages of the Website • Web Server log files are downloaded and processed • The Integration Engine is a suite of programs for data preparation ie extracting, cleaning, transforming, integrating data and finally loading into database and later generating graphs in XGML. http://www.cs.rpi.edu/~youssefi/research/VWM CSE:634 Web Mining
We extract user sessions from web logs , this yields results of roughly related to a specific user • The user sessions are converted into format suitable for Sequence Mining • Outputs are frequent contiguous sequence with given minimum support. • These are imported into a database • Different queries are executed against this data. http://www.cs.rpi.edu/~youssefi/research/VWM CSE:634 Web Mining
APPLICATIONS • Designing different visualization diagrams and exploring frequent patterns of user access on a website • Classification of web pages into two classes : hot and cold : attracting high and low number of visitors. • A webmaster can make exploratory changes to website structure and analyze the change in user access patterns in real world. http://www.cs.rpi.edu/~youssefi/research/VWM/ CSE:634 Web Mining
Sentiment Classification Vaishali Kshatriya 105951122
References • The Sentimental Factor: Improving Review Classification via Human-Provided Information. - Philip Beineke , Shivakumar Vaithyanathan and Trevor Hastie • Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (July 2002) • http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm • http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances • Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web" Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan. CSE:634 Web Mining
Sentiment Classification • It is a task of labeling a review document according to the polarity of its prevailing opinion. CSE:634 Web Mining
Online Shopping CSE:634 Web Mining
Topical vs. Sentimental Classification Topical Classification • Classifying documents into various subjects for example : Mathematics, Sports etc • comparing individual words (unigrams) in various subject areas (Bag-of-Words approach). Example : “score”, “referee”, “football” => Sports Sentiment Classification • classifying documents according to the overall sentiment positive vs. negative E.g. like vs. dislike; Recommended vs. not recommended • More difficult compared to traditional topical classification. May need more linguistic processing E.g. “you will be disappointed” and “it is not satisfactory” http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm CSE:634 Web Mining
Challenges • Dependence of context on the document – “unpredictable” plot, “unpredictable” performance • Negations have to be captured • The movie was not that bad. • The pictures taken by the cell is not of best quality. • Subtle Expressions: • “How can someone sit through the entire movie?” http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances CSE:634 Web Mining
Unsupervised review classification (Turney ACL -02) • Input: Written review • Output: classification (i.e. positive or negative) • Step 1: Use part-of-speech tagger to identify phrases • Step 2: Estimate the semantic orientation of extracted phrase • Step 3: Assign the given review to a class (either recommended or not recommended) Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02) CSE:634 Web Mining
Step 1: Extract the phrases • Part-of-speech tagger is applied to the review • Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table where JJ: Adjective and NN: Noun Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02) CSE:634 Web Mining
Step 2: Estimate the semantic orientation • Uses PMI-IR (Pointwise Mutual Information and Information Retrieval) • PMI between 2 words, word1 and word2 can be defined as : • The Semantic Orientation (SO) of a phrase is calculated as : • SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”) • SO is positive when the phrase is more strongly associated with excellent and negative when it is more strongly associated with poor. Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02) CSE:634 Web Mining
Step 2 (cont’d) • PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting the number of hits (matching documents). • The experiment uses AltaVista Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02) CSE:634 Web Mining
Step 3: Assign a Class • Calculate the average of the SO of the phrases and classify them as recommended if the average is positive and not recommended if the average is negative. Reviews of a bank Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02) CSE:634 Web Mining
Drawbacks • Sentiment classification is useful but it does not find what the reviewer liked or disliked. • A negative sentiment on an object does not imply that the user did not like anything about the product • Similarly a positive sentiment does not imply that the user liked everything about the product • The solution is to go to sentence and feature level http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features CSE:634 Web Mining
Feature based Opinion mining and summarization (Hu and Liu ‘04) • Interested in what reviewers liked and disliked • Since the number of reviews of an object can be large, the goal was to produce simple summary of the reviews • The summary can be easily visualized and compared http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features CSE:634 Web Mining
Three main tasks: • Step1 : Identify and extract object features that have been commented on in each review • Step 2: Determine whether the opinion on the review is positive, negative or neutral • Step 3: Group synonyms of features • Produce a feature-based summary!! http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features CSE:634 Web Mining
Online Shopping CSE:634 Web Mining
Summary • Classification of reviews as good or bad: sentimental classification • Unsupervised review classification extracts the phrases from the review, estimates the semantic orientation and assigns a class to the review • The solution for the short-comings of the sentimental classification is feature-based opinion extraction CSE:634 Web Mining
Discovering Web communities on the web Mehru Anand (106113525)
References • Inferring Web Communities from Link Topology (1998)David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext. • Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks. • Finding Related Pages in the World Wide Web (1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks. • A System for Collaborative Web Resource Categorization and RankingMaxim Lifantsev. • Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,madrias@umr.edu Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Introduction • Introduction of the cyber-community • Methods to measure the similarity of web pages on the web graph • Methods to extract the meaningful communities through the link structure Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
What is cyber-community • A community on the web is a group of web pages sharing a common interest • Eg. A group of web pages talking about POP Music • Eg. A group of web pages interested in data-mining • Main properties: • Pages in the same community should be similar to each other in contents • The pages in one community should differ from the pages in another community • Similar to cluster Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Two different types of communities • Explicitly-defined communities • They are well known ones, such as the resource listed by Yahoo! • Implicitly-defined communities • They are communities unexpected or invisible to most users eg. Arts Music Painting Classic Pop eg. The group of web pages interested in a particular singer Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Two different types of communities • The explicit communities are easy to identify • Eg. Yahoo!, InfoSeek, Clever System • In order to extract the implicit communities, we need analyze the web-graph objectively • In research, people are more interested in the implicit communities Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Similarity of web pages • Discovering web communities is similar to clustering. For clustering, we must define the similarity of two nodes • A Method I: • For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A • Not so good. Consider the home page of IBM and Microsoft. Page A Page B Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Similarity of web pages • Method II (from Bibliometrics) • Co-citation: the similarity of A and B is measured by the number of pages cite both A and B • Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B. Page A Page B Page A Page B Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Methods of clustering • Clustering methods based on co-citation analysis: • Methods derived from HITS (Kleinberg) • Using co-citation matrix • All of them can discover meaningful communities But their methods are very expensive to the whole World Wide Web with billions of web pages. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Trawling the Web for emerging cyber-communitiesProceeding of the eighth international conference on World Wide Web Toronto, Canada Pages: 1481 - 1493 Year of Publication: 1999 ISSN:1389-1286 Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
A cheaper method • The method from Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins • IBM Almaden Research Center • They call their method communities trawling (CT) • They implemented it on the graph of 200 millions pages, it worked very well Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria
Fans Centers Basic idea of CT • Definition of communities • dense directed bipartite sub graphs • Bipartite graph: Nodes are partitioned into two sets, F and C • Every directed edge in the graph is directed from a node u in F to a node v in C • dense if many of the possible edges between F and C are present F C Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria