Search Engine using Web Mining

Search Engine using Web Mining COMS E6125.001 Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Web Mining Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from Web data. Data mining efforts associated with the Web is known as Web Mining.

Classification of Web Mining • Content Mining: refers to the discovery of useful information from Web content, including text, images, audio, and video. Web content mining research includes resource discovery from the Web, document categorization and clustering, and information extraction from Web pages. • Usage Mining: Web link structure has been widely used to infer important information about Web pages • Structure Mining: to understand the structure of the Web as a whole. Citations (linkages) among Web pages are usually indicators of high relevance or good quality. The term in-links to indicate the hyperlinks pointing to a page and the term out-links to indicate the hyperlinks found in a page.

Data Source The usage data collected at the different sources will represent the navigation patterns of different segments of the overall Web Traffic, ranging from single user, and single site browsing behavior to multi user and multi site access patterns. • Server Level Collection • Client Level Collection • Proxy Level Collection

Server Level Collection • A Web server log is an important source for performing Web Usage Mining because it explicitly records the browsing behavior of site visitors. • The data recorded in server logs reflects the access of a Web site by multiple users. These logs can be stored in various formats such as Common log or Extended log formats. • Cookies are tokens generated by the Web server for individual client browsers in order to automatically track the site visitors. Tracking of individual users is not an easy task due to the stateless connection model of the HTTP protocol.

Contd… • Cached page views are not recorded in a server log. In addition, any important information passed through the POST method will not be available in a server log.

Client Level Collection • It can be implemented by using a remote agent (such as Java scripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. • The implementation of client-side data collection methods requires user cooperation, either in enabling the functionality of the Java scripts and Java applets, or to voluntarily use the modified browser.

Proxy Level Collection • A Web proxy acts as an intermediate level of caching between client browsers and Web servers. Proxy caching can be used to reduce the loading time of a Web page experienced by users as well as the network traffic load at the server and client sides. • Proxy traces may reveal the actual HTTP requests from multiple clients to multiple Web servers. This may serve as a data source for characterizing the browsing behavior of a group of anonymous users sharing a common proxy server.

Pattern Discovery • Discovering sequential pattern is to find inter-transaction patterns such that the presence of a set of items is followed by another item in the timestamp ordered transaction set. In Web server transaction logs a visit by a client is recorded over a period of time. • The discovery of sequential patterns in Web server access logs allows Web based organizations to predict user visit patterns and helps in targeting advertising aimed at groups of users based on these patterns By analyzing this information the Web mining system can determine temporal relationships.

Pattern Analysis • Pattern Analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done. • The most common form of pattern analysis consists of a knowledge query mechanism such as SQL. • Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.

Application of Web Mining • Counter-Terrorism • E-Commerce • Security Threat and many more

Future Scope of Web Mining • Web mining research has been the difficulty of creating suitable test collections that can be reused by researchers. Atest collection is important because it allows researchers to compare different algorithms using a standard test-bed under the same conditions, without being affected by such factors as Web page changes or network traffic variations. • Although textual documents are comparatively easy to index, retrieve, and analyze, operations on multimedia files are much more difficult to perform; and with multimedia content on the Web growing rapidly, Web mining has become a challenging problem. Various machine-learning techniques have been employed to address this issue. Predictably, research in pattern recognition and image analysis has been adapted for study of multimedia documents on the Web.

Conclusion • As Web and its usage continues to grow, so it grows the opportunity to analyze Web data and extract all manner of useful knowledge from it. • Web Mining is still in their initial stage and should continue to develop as Web evolves. One future research direction for Web Mining is Multimedia data mining. In addition to textual documents like HTML, MS Word, PDF and Plain text files, a large number of multimedia documents are contained on the Web such as images, audio and video.

Thank You

Search Engine using Web Mining

Search Engine using Web Mining

Presentation Transcript

Search Engine Optimization 101 Building a Search Engine Friendly Web Site.

Search Engine Optimization 101 Building a Search Engine Friendly Web Site.

Search Text Mining Web Site Usability

CS276B Web Search and Mining

Web browser , Search Engine

Web Search/Browse Log Mining

Search Text Mining Web Site Usability

Web Search and Text Mining

Web Search and Text Mining

Data Mining Information Retrieval Web Search

Web Search and Text Mining

CS276B Web Search and Mining

Data Mining Information Retrieval Web Search

Web Search and Text Mining

CS276B Web Search and Mining

Web Search and Data Mining

Using the Lucene Search Engine

Web Search Engine Optimization

Web Search and Text Mining