Create Presentation
Download Presentation

Download Presentation
## Concept Ontology For Text Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Concept Ontology For Text Classification**SwaranikaPalit**Content**Introduction Shrinkage in a hierarchy of Classes Hierarchical classification using few words Enhanced word clustering for Hierarchical Text Classification Hierarchical Classification Of Web Content References**Motivation**Huge volumes of worldwide accessible information Need to assimilate and profitably use these large information Manual organizing may not be feasible : expensive ,time constraint and involvement of large no of documents Automated Classification: Combination of information retrieval technology and machine learning**Text Classification Task**• Text Classification is the task of automatically sorting a set of documents into categories from a predeﬁned set. • Given a text document, decide which of several pre-defined categories it belongs to. • The count of each word gives the document a certain probability of belonging to each category. • Task of approximating unknown target function Φ : D × C → {T, F } (that describes how documents ought to be classiﬁed, according to a supposedly authoritative expert) by means of a function Φˆ : D×C → {T, F } called the classiﬁer, where C = {c1 ,..., c|C| } is a predeﬁned set of categories and D is a (possibly inﬁnite) set of documents. If Φ(dj , ci ) = T,thendjis called a positive example (or a member) of ci , while if Φ(dj , ci ) = F it is called a negative example of ci .**Hierarchical Text Classification**Motivation: • Categories are treated independently and relationship among them is not exploited in flat classifier • Dealing with larger problems by divide-conquer approach Hierarchy = = is-a-relationship Sportspersons Cricketer Sachin Tendulkar**Statistical Approaches For learning text classification**Each document di is represented as a vector of word counts, generated from a multinomial distribution:**Modeling Document Classes**Documents that have similar word count vectors belong to the same category and Categories or classes are modeled as clusters of documents in word-count space. Mixture Model of Classes: The overall probability of a document is a mixture of multinomial distributions of the probability of the document belonging to each class.**Problem**Typically represent documents as vectors of words Rely on learned word statistics, so these approaches are data-intensive Require large number of hand-labeled training documents per class to achieve high classification accuracy**Question**How can we scale up statistical learning algorithm 1)To task with large number of classes 2) Sparse Learning data per class Introduction Of Advanced Techniques……….**Shrinkage in Hierarchy Of Classes**Apply a technique from Statistics called “ Shrinkage” which provides improved estimates of parameters which would otherwise be uncertain due to sparse training data. This method exponentially reduces amount of computation for classification with a small amount of accuracy sacrificing Specially applicable when hierarchy is large and training data is sparse It exploits hierarchy by shrinking parameter estimates in data-sparse children towards the estimates of data-rich parents. Creates new parameter estimates in a child by linear interpolation of all hierarchy nodes from child to root.**Shrinkage in Statistics**Given several different estimates of the same quantity and the estimates can be improved by scaling and shifting them all toward a common value, thereby shrinking the variance of each.**“grow**corn tractor…” (Crops) Testing Data: Crops Evolution Magnetism Relativity Irrigation Botany Categories: selection mutation Darwin Galapagos DNA... corn wheat silo farm grow... corn tulips splicing grow... water grating ditch farm tractor... Training Data: ... ... Text Classification Automatically placing documents in their correct categories.**“corn**grow tractor…” The Idea: “Shrinkage” We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. (Crops) Science Testing Data: Agriculture Biology Physics Crops Botany Evolution Relativity Irrigation Magnetism Categories: corn wheat silo farm grow... selection mutation Darwin Galapagos DNA... water grating ditch farm tractor... corn tulips splicing grow... Training Data: ... ...**Bayesian Learning Framework Approach**Assume that text data is generated by parametric model and use training data to calculate estimates of model parameters. With the help of these estimates, we classify new text documents by Bayes rules and calculate the posterior probability that a class would have generated the test document in question. Classification will then become to select the most probable class. Apply the Bayes assumption: Probability of each word event in a document is independent of word’s context given the class and of its position in document.**Bayesian Text Classification**• Given document di, find the category or classes cj, that it belongs to. Two parameters must be learned: • The probability that the category cj generated document di. • The prior probability of category cj**Continued…….**N(wt,di) is the count of the number of times wt occurs in document di and P(cj|di) ={0.1} as the document’s class llevel.The estimate of probability of word wt in class cj is as :**Shrinkage for Text Classification**Use shrinkage to better estimate the probability of a word given a class, For each node in a tree a maximum likelihood(ML) estimate based on the data associated with the node is calculated. Improved estimate of each leaf node is achieved by shrinking its ML estimate towards ML estimate of all of its parents. In term of statistical modeling, unigram model is build for each node and each leaf model is smoothed by linearly interpolating it with all models found along the path to root. The estimate along the path from leaf to roof represents a tradeoff between specificity and reliability.**Question**Given a set of ML estimates along the path from leaf to root , how do we decide on the weights for interpolating (mixing) them? Shrinkage Algorithm……..Learning Mixture weights ‘λ’: Shrinkageimproves the estimated probability of word wtgiven class cj, by bringing it closer to for super-classes k, through linear combination:**Shrinkage Algorithm**• Calculate initial parameter estimates using word count • Initialize weights for the super-classes to any normalized value, such as λ ji= 1/k. • Repeat until convergence: Expectation step Maximization step • Expectation Step: Use the current l’s to estimate the degree to which each node was likely to have generated the words in held out documents. • Maximization Step: Use the estimates to recalculate new values for the λ’s.**Experimental Results**The shrinkage algorithm is tested on three hierarchical sets of documents: Company web pages, classified by industry sector. Usenet articles, classified into 20 newsgroup categories. Web pages from the Yahoo “science” and “health” hierarchies. From the industry sector figure, it shows that : Larger vocabulary sizes generally performs better , shrinkage improves classification accuracy of 74% , Hierarchical feature selection improves the naive Bayes classifier around 5000 words, Shrinkage performs better than naive Bayes for all vocabulary sizes**Yahoo Science**• Accuracy does not improve with vocabulary size for either classifier. • Accuracy is actually reduced for naïve Bayes above 10,000 words. • The lack of improvement may be an artifact of the experimental setup, which can only handle hierarchies of depth two. • The second graph shows accuracy averaged over classes instead of documents. • Shrinkage shows a much greater performance improvement over naive Bayes. • This is due to the nature of the Yahoo training set, in which the majority of documents were lumped into a few classes**NewsgroupShrinkage Helps more when training data is sparse**In this experiment, the vocabulary size is held constant, while the size of the training set varies. Hierarchical feature selection does not improve either classifier and is not shown. Shrinkage performs better than naive Bayes, but by a smaller margin, due to lower fan-out. Shrinkage performs better than naive Bayes by a greater margin for smaller training sets.**Hierarchical Classification using few words**Motivation: Categorize the different documents according to their topic, where topics are organized in a hierarchy of increasing specificity. Facing with large number of classes and huge number of relevant features to distinguish them in context of text classification Restriction to use simple classifier due to computational cost and tendency of complex models to over fit. Approach: Divide the classification task into a set of smaller classification tasks, each of which corresponds to some split in classification hierarchy Each of subtask is significantly simpler than original task .**Key Note**The classifier at each node in hierarchy is needed to be distinguished only between smaller number of categories, so it its possible to use small set of features. With small feature set , models are robust and less subject to over fitting and provides better accuracy even for simple classifiers. Basic point is integration of features with hierarchical structure along with its selection. In Hierarchical approach, any document only sees a small fraction of features through out the process and the feature it sees are divided so as to focus the attention of classifier on the features relevant to the classification subtask at hand. Applicable to medical domain also-to classify the disease that a patient has based on symptoms and test results**Probabilistic Framework**Construct a hierarchical set of classifiers, each based on its own set relevant features. Two main subroutines: (1)Feature Selection Algorithm for deciding on the appropriate feature set at each decision point. (2)Supervised Learning Algorithm for constructing a classifier on this decision. Basic idea is our model of world is represented as a probabilistic distribution over the space of possible states of the world. A state of world is described via some set of random variables, so that each state is an assignment of values to these variables.**Bayesian Classifier**A Bayesian network allows to provide compact descriptions of complex distributions over a range number of variables. It uses a directed cyclic graph to encode conditional independence assumptions about the domain and these assumptions allow the distribution to be described as a product of small local interaction models. Bayesian classifier is a Bayesian network applied to classification domain. It contains node C for class variable and a node Xi for each of the features. For specific instance x, it allows to compute the probability P(C =ck|X = x) for each possible class ck. Bayes Optimal classification is achieved by selecting class ck for which this probability is maximum.**Naïve Bayesian classifier restricts the domain features to**be conditionally independent of each other , given the class variable. But it is clearly unrealistic in text domain so approaches proposed for augmenting Bayesian classifier with limited interactions between feature variables i.e allowing to have some parents beyond the class variables for each node.**Feature Selection**Employs information theoretic measures to determine a subset of original domain features that seem to best capture the class distribution in the data. For each feature Xi, it determines expected cross-entropy: where X_i is the set of all domain features except Xi. Then eliminates the features Xi for which is minimized The feature eliminated least disrupts the original conditional class distribution The process can be iterated to eliminate as many feature as desire. Since it assumes conditional independence of features , it is a fast process to update P(C|X) to P(C|X_i),after elimination of Xi. So, it is very applicable to text domain with many features.**Experimental Results**Two subsets of Reuters collection Hier1 and Hier2 were extracted and were split 70%/30% into class stratified training and testing sets. Apply a single pass of Zipf’s law-based feature selection method which eliminates all words appearing fewer than 10 or more than 1000 times in the training corpus. Both hierarchical and flat classification schemes are ran on the datasets without employing any probabilistic feature selection. Original numbers of features in each dataset and results are given in the following table:**Two important phenomenon are observed:**In Hier1 dataset, the very large number of features used precludes the hierarchical scheme from performing better than simple flat method. In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB algorithm to overfit the training data. The initial results provide an empirical motivation for integration of feature selection. Next , feature set had been reduced to 20 and then 10 features i.e. the hierarchical method actually examines a large set of features and also for flat method 50 most relevant features had been used. The results are shown in the table: In every run using feature selection , an improvement in accuracy was found ,even though 1248 features were eliminated from Hier1 dataset. Hierarchical method clearly outperforms the flat classification method while considering a direct comparison of the 10 and 20 runs. Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative gains for Hier2.**The results reveal that feature selection stage does serve**to focus the algorithm on the features relevant to the local classification task. The last table shows that the set of 10 features found to be most discriminating at each level of hierarchy learned for the Hier1 dataset. At the top level of hierarchy, high level terms are selected from various major topics. More specific words distinguishing the subtopics are seen.**Enhanced Word Clustering For Hierarchical Text**Classification Typically document vectors are formed using a vector-space or bag-of-words mode and is extremely high dimensional. This high dimensionality can be serve as obstacle for classification algorithm based on SVM, k-nearest neighbor. Distributional clustering of words/features is a way to reduce dimensionality where each word cluster is treated as single feature. Feature clustering is more efficient than feature selection for lower number of features. For small training set and noisy feature ,word clustering increases the classification accuracy. Algorithms used are agglomerative in nature yielding suboptimal word clusters at a high computational cost .**Distributional Word Clustering**C is a discrete random variable that takes on values from the set of classes C={c1,c2….cl} and W be the random variable that ranges over the set of words W={w1,…wm} The joint distribution p(C,W) can be estimated from the training set Let, we cluster the words into k clusters W1,W2,…Wk. Here we will consider only “hard clustering” where each word belongs to exactly one word cluster, as we are concentrating on reducing the number of features and model size: Random variable W ranges over the word clusters.**To judge the quality of the word clusters we will use**information-theoretic nature. The information about C captured by W is measured by mutual information I(C;W).Ideally ,in formation of word clusters exact preservation of mutual information is expected. But clustering usually lowers mutual information. This motivates to find a clustering that minimizes the decrease in mutual information, I(C;W) - I(C;W ) Introduce Divisive algorithm for word clustering based on Kullbak-Leibler(KL) divergences.**Classifying using Word Clusters**We can adapt Naïve Bayes method which will translate into using word clusters instead of words. Naïve Bayes classifier for word parameters: Estimate new parameters p(Ws|ci) for word clusters similar to p(wt|ci) as:**Estimate of p(wt|ci) for individual words are relatively**poor ,the corresponding word cluster parameters p(Ws|ci) provides more robust estimates which results in higher classification score. Naïve Bayes rule for classifying a test document can be written as SVM can similarly used with word clusters**Divisive Clustering achieves higher classification accuracy**than the best feature selection method when training data is sparse. The previous figures plot fraction of mutual information lost against number of clusters for both Divisive and agglomerative algorithm in 20Ng and Dmoz datasets. Fraction of mutual information lost due to clustering words is given as: The term in the numerator is precisely the global objective function that Divisive Clustering attempts to minimize. Less mutual information is lost in Divisive Clustering compared to agglomerative at all number of clusters,though the difference is more pronounced at lower number of clusters**Figure 5 shows classification accuracies on the 20**Newsgroups data set for the algorithms considered. The horizontal axis indicates the number of features/clusters used in the classification model while the vertical axis indicates the percentage of test documents that were classified correctly Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of features than feature selection using Information Gain and Mutual Information With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy The largest gain occurs when the number of clusters equals the number of classes. In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training data is sparse. 2% of the available data is taken, that is 20 documents per class, for training and tested on the remaining 98% of the documents The results are averages of 5-10 trials Divisive Clustering obtains better results than Information Gain at all number of features. It also achieves a significant 12% increase over the maximum possible accuracy achieved by Information Gain. This is in contrast to Figure 5 where Information Gain eventually catches up as the number of features increases When the training data is small the word by class frequency matrix contains many zero entries. By clustering words, more robust estimates of word class probabilities are obtained which lead to higher classification accuracies**Hierarchical classification Of Web Content**In Hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. Classification problem is decomposed into a set of smaller problems corresponding to hierarchical splits in the tree by utilizing hierarchical structure. Use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation. Each sub-problem is smaller than the original problem and is sometimes possible to use a much smaller set of features for each. Hierarchical structure can also be used to set the negative set for discriminative training and at classification time to combine information from different levels.**Use Of SVM for Web Content**• SVM learning model is found to be efficient and effective for text classification. • The efficiency of SVMs for both initial learning and real-time classification make them applicable to large dynamic collections like web content. • Hierarchical structure is used for two purposes • Train second-level category models using different contrast sets (either within the same top-level category in the hierarchical case, or across all categories in the flat non hierarchical case). • Combine scores from the top- and second-level models using different combination rules, some requiring a threshold to be exceeded at the top level before second level comparisons are made**Classifying web search results:**• Classification techniques are used to automatically organize search results into existing hierarchical structures • Classification models are learned offline using a training set of human-labeled documents and web categories • Classification offers two advantages compared to clustering – • run time classification is very efficient • manually generated category names are easily understood Constraints: • To support goal of automatically classifying web search results two constraints • Use of just the short summaries returned from web search engines. since it takes too long to retrieve the full text of pages in a networked environment. These automatically generated summaries are much shorter than the texts used in most classification experiments, and they are much noisier than other document surrogates like abstracts that some have worked with. • Focus on the top levels of the hierarchy since we believe that many search results can be usefully disambiguated at this level. • Develop an interface that tightly couples search results and category structure are found to have large preference and performance advantages for automatically classified search results