On the use of side information for mining text data

On the utilization aspect of Document data for Mining the Text knowledge IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014 On the use of side information for mining text data

A Software /Manufacturing Research Company Run By Microsoft Most Valuable Professional VenkatesanPrabu .J MANAGING DIRECTOR Microsoft Web Developer Advisory Council team member and a well known Microsoft Most Valuable Professional (MVP) for the year 2008, 2009, 2010,2011,2012,2013 ,2014. LakshmiNarayanan.J GENERAL MANAGER BlackBerry Server Admin. Oracle 10g SQL Expert. Arunachalam.J Electronic Architect Human Resourse Manager

Abstract • In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. • However, the relative importance of this side-information may be difﬁcult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. • Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach.

Existing System • TheThe term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. • stop words are words which are filtered out prior to, or after, processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search. • In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a number of so-called stemming Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the key terms of a query or document are represented by stems rather than by the original words. • This not only means that different variants of a term can be conflated to a single representative form – it also reduces the dictionary size, that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a saving of storage space and processing time.

Proposed System • Having the compare to analysis between the URL and the document. Supporting links will be crawled by analyzing the url • The application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining. • Any group of words can be chosen as the stop words for a given purpose. For some search machines, these are some of the most common, short function words, such as the, is, at, which, and on. • In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The', or 'Take That'. Other search engines remove some of the most common words—including lexical words

System Architecture • HARWARE REQUIREMENT: Processor : Core 2 duo Speed : 2.2GHZ RAM : 2GB Hard Disk : 160GB • SOFTWARE REQUIREMENT: Platform : DOTNET (VS2010) , ASP.NET Dotnet framework 4.0 Database : SQL Server 2008 R2

Architecture Diagram

Records Breaks Asia Book Of Records Tamil Nadu Of Records India Of Records MVP Awards World Record

Services: A Software /Manufacturing Research Company Run By Microsoft Most Valuable Professional Inplant Training. Internship. Workshop’s. Final Year Project’s. Industrial Visit. Contact Us: +91 98406 78906,+91 90037 18877 kaashiv.info@gmail.com www.kaashivinfotech.com Shivanantha Building (Second building to Ayyappan Temple),X41, 5th Floor, 2nd avenue,Anna Nagar,Chennai-40.

On the use of side information for mining text data

On the use of side information for mining text data

Presentation Transcript

Text Mining Tools

Data Mining with Clementine

Frequent Item Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

DATA WAREHOUSING AND DATA MINING

Advanced Topics in Data Mining: Association Rules

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Tools

Lexical networks, lexical centrality, and text mining

Data Mining with DB

Text Structure

Data Mining using Fractals and Power laws

Data Mining with CANape 9.0

Temple University – CIS Dept. CIS616– Principles of Data Management