310 likes | 747 Vues
Data Mining on the Web via Cloud Computing. COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy. Data Mining on the Web via Cloud Computing. Introduction to – Web Mining Cloud computing infrastructure Apache’s Hadoop
E N D
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy
Data Mining on the Web via Cloud Computing • Introduction to – • Web Mining • Cloud computing infrastructure • Apache’s Hadoop • Web Usage Mining using Hadoop HDFS and Map/Reduce technologies
What is Web Mining… • What is Web Mining - data mining techniques applied to the Web to discover user patterns like • what users are looking for on the internet, • to deduce type of information the users are looking for, • structuring data available on the web etc. • Why Web Mining – • amount of information available on the Web is enormous. • difficult for users to find and utilize information • not easy for content providers to classify and catalog documents
Types of Web Mining • Web mining types – • Web usage mining. • Web content mining. • Web structure mining. • Web usage mining - applying data mining techniques to discover usage patterns from Web data, to understand and serve the needs of Web-based applications better. • Web content mining describes the automatic search of information available online, and involves mining web data content. • Web structure mining is concerned with the description/ organization of the content.
More on Web Usage Mining… • Preprocessing. • convert the usage, content, and structure information in the available data sources. • regarded as the most difficult task in Web Usage Mining. • Pattern Discovery. • uses the algorithms and techniques from data mining, machine learning, statistics and pattern recognition. • Pattern analysis. • lot of redundant rules or patterns found during discovery phase. • the main objective here is to filter out such data which would aid in the data analysis. • SQL queries, visualization techniques such as graphing patterns etc
Cloud Computing • Use of existing commodities. • reduce cost of the services. • helps in concentrating on deploying the services faster. • more flexibility. • Virtualization technique used as a standard deployment object. • provides abstraction between hardware and computing software. • enables loose coupling of the resources. • Services are delivered over the network.
HDFS - Hadoop Distributed File System • Data parallel but process sequential. • Data processing is in a batch oriented fashion. • Data communication is via distributed file system. So, latency is an issue. But HDFS is designed for giving higher throughputs than latency. • In Facebook, jobs that took more than a day were cut down to less than a day by using Hadoop.
Important characteristics of HDFS… • Hardware Failure. • Streaming Data Access. • Large Data Sets. • Moving Computation is Cheaper than Moving Data
Web Mining, HDFS and Map/Reduce • HDFS can be the storage backbone for Web Mining applications. • HDFS replicates data at several nodes in the cluster to ensure robustness, data recovery in case of failure etc. • Map/Reduce – A framework for realizing Distributed computing/Compute Cloud.
Web Mining & HIVE • Developed by the Facebook Data Infrastructure Team in order to exploit the features of Hadoop HDFS and Map/Reduce. • The next generation infrastructure designed with the goals of providing data processing systems: • enable easy data summarization • ad-hoc querying and analysis of large volumes of data • Allows users to embed custom map/reduce functions
Web Usage Mining Architecture using HDFS, Map/Reduce and HIVE • How Apache Hadoop can be used in Web Usage Mining. • The system consists of HDFS as the Storage Cloud. • Map/Reduce framework can be used as the Compute Cloud. • Hive can be used to format the data.
References • HDFS: http://hadoop.apache.org/hdfs • Map/Reduce: http://hadoop.apache.org/mapreduce • Web Mining: Information and Pattern Discovery on the World Wide Web: http://maya.cs.depaul.edu/~mobasher/webminer/survey/survey.html • Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using Hadoop: http://www.facebook.com/note.php?note_id=89508453919
References • Dhruba Borthakur: Hadoop Introduction: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html#Introduction • Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, Pang-Ning Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data