340 likes | 354 Vues
AN OPTIMIZED APPROACH FOR k NN TEXT CATEGORIZATION USING P-TREES. Imad Rahal and William Perrizo Computer Science Department North Dakota State University Fargo, ND imad.rahal@ndsu.nodak.edu. Outline. The Text Categorization problem The P-tree technology Vector Space Model
E N D
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University Fargo, ND imad.rahal@ndsu.nodak.edu
Outline • The Text Categorization problem • The P-tree technology • Vector Space Model • Proposed Solution • Intervalization (discretization) • P-tree representation • Similarity measures • Categorization algorithm • Performance analysis study
Text categorization problem • Text Categorization (topic spotting or text classification) is the process of assigning categories or labels to documents based entirely on their contents • Problems • text has no explicit structured unlike other data (e.g. relational data) • information is described freely in the documents • (After introducing structure) huge number of features
Motivation • Increase in the number of text documents (on the Internet!) • Medical articles • Research Publications • E-mails • News reports (e.g. Reuters) • others • Most algorithms fail to scale up because of the curse of dimensionality • Most algorithms suffer from relatively low accuracy
The P-tree technology • Tree-like data structure that store numeric (and categorical) relational data in bit-compressed format by • splitting each attribute into bits • representing each bit position by a P-tree
P-trees are characterized by • 1-time creation cost • Compression • High speed processing (ANDing, no DB scans) • The latest bench mark on P-tree ANDing has shown a speed of 6 ms for two 1320x1320 images (i.e. two bit sequences each containing 1.6 million bits represented using P-trees)
We have 8 P-trees in total for each attribute shown in the previous example: • PA,7 PA,6 PA,5 PA,4 PA,3 PA,2 PA,1 and PA,0 • To query for a certain attribute value, say Attribute A = 1110 0001, we do the following: • PA,1110 0001 = PA,7 & PA,6 & PA,5 & P’A,4 & P’A,3 & P’A,2 & P’A,1 & PA,0 • We can have varying bit precision. We query for A = 001, we do the following: • PA,001 = P’A,3 & P’A,2 & PA,0
Vector Space Model • Each document is represented as a vector whose dimensions are the terms in the initial document collection • Each vector coordinate is a term and has a numeric value which represents its relevance to the document. Usually higher values imply higher relevance
Three popular weighting schemes are: Binary, TF, and TF*IDF. • The binary scheme uses the values 1 and 0 to reveal whether a term exists in the document or not • The term frequency (TF) scheme counts the occurrences of a term in a document. Usually measures are normalized to help overcome the problems associated with document length
The TF*IDF scheme multiplies the coordinate measure derived by the TF scheme by a global weight called the IDF. The IDF measure for term t is defined as log(N/Nt) where N is the total number of documents and Nt is the total number of documents containing t. The cosine normalization is usually used
Proposed solution • Model 1: Classification over binary representation is not accurate but fast • Model 2: Classification using exact counts (tf, idf, normalized tf…) more accurate but slower (very high dimensional space) • This can be viewed as a concept hierarchy
Work along this hierarchy by using intervals • Better speed than model 2 (approaching to Model1) • Better accuracy than model 1 (approaching Model2)
An example • say we’re using TF (values normalized in the range of [0,1]) • divide range into 4 intervals: None, Low, Medium, High • Each interval will be represented by a string of bits (we have four intervals so we need 2 bits) • None = “00”, Low =“01” , Medium = “10” and High=“11” (note the order among them) • Each bit position will be represented by a P-tree; so we have 2 P-trees for every dimension
kNN Algorithm • Used to find the k most similar points (referred to as k neighbours) to some given point P in some space and then assigning a proper class to P using the class labels of the k neighbours • Usually proceeds by the selecting the neighbours first (selection phase) and then assigning the class label (voting phase)
Categorization Algorithm: Selection Phase • Initialize a P-tree, Pnn, to contain only pure-1 quadrants (i.e. all entries in it are 1’s) – identity P-tree • Order the set of all term P-trees S in descending order from term P-trees representing higher to lower interval values in dnew • For every term P-tree, Pt, in S do the following • AND Pnn with Pt • If root count of Pnn is less than k, expand Pt by removing the rightmost bit from the interval value (i.e. interval 01 and 00 become 0 and intervals 10 and 11 become 1). This could be done by recalculating the Pt while disregarding the rightmost bit P-tree. Repeat this step until the root count of Pnn AND Pt is greater than k – this is guaranteed to happen at least when all the bits are disregarded. • Else, put the result in Pnn • Loop • End of selection phase
Pnn P3 P7 P6 P4 P5 P1 P2
Categorization Algorithm: Voting Phase • For every class ci, loop through dnew vector and do the following for every term tj in dnew vector: • Get the P-tree representing the neighboring documents (Pnn from the selection phase) having the same value for t (Pt) and class ci (Pi). This could be done by calculating Presult = Ptj AND Pnn AND Pi • If the term under consideration has a value Ij then multiply the root count of Presult by (Ij+1) //if we want to neglect Ij=“00” then don’t add 1 • Add the result to the counter of ci, w(ci). • Loop • Select the class ck having the largest counter w(ck) as the respective class of dnew • End of voting phase
Performance analysis study • Compared accuracy and speed to cosine-similarity KNN and accuracy to string kernels approach by Lodhi et al. (Journal of Machine Learning Feb. 2002) • Speed • Used synthetic document x term matrices with different sizes
Accuracy: • Followed the sampling approach depicted in the string kernels approach • Tested over a subset of the Reuters-21578 collection (analysis over the whole dataset if still underway) • Experimented on four classes namely: acquisition, earn, corn, and crude. We used k=3 and a 4-interval value set, I0=[0,0], I1=(0,0.25], I2=(0.25,0.75] and I3=(0.75,1]. • Averaged precision (not shown), recall (not shown) and F1-measures (2pr/(p+r)) for our approach and cosine KNN and compared with string kernels
Compared to the KNN approach, we show much better results in terms of speed and accuracy • The reason for the improvement in speed is mainly related to the • complexity of the selection phases: O(n) VS O(mn) where m is the size of the dataset – number of rows – and n is the number of dimensions. • and P-tree ANDingspeed.
As for accuracy, • the KNN approach uses the angle between the vectors and considers all terms • Our approach uses ANDing to compare the closeness of the value of each term and to ignore unneeded terms (those whose ANDing renders a less than k neighbors)
As for the kernels approach, it would not be appropriate to compare speeds here because the two approaches are fundamentally different. • Example-based VS Eager • Context sensitive VS Context insensitive • In general, results were very comparable results
The range for the precision, recall and F1 measurements in the other two approaches spreads over a wider range than they do in ours which indicates that our P-tree based approach’s accuracy is less variable across categories or classes thus leading to more stable results in general
Drawsbacks • Needs tuning • We need to decide upon the number of intervals and their ranges ahead of time (analysis for varying those is still underway) • Since this is a KNN algorithm, K must also be known ahead of time
Conclusion • We have shown • Higher accuracy • the use of sequential ANDing in selection • Very fair voting • Use of closed neighbourhood (in case root count is greater than K) – refer to Maleq Khan’s thesis (Dec. 2001) for previous work
Better space utilization • reducedcompressed space • Reduced space due to intervalization (from 8 bits to 2 bits reduction by a factor of 4) • Compression due to the use of P-trees • Higher speed • Due to P-trees • No DB scans • Based on the AND operation which is among the fastest computer instructions
Future direction • Solve the problem of random ANDing for term P-trees having the same values? Information gain? • Test the effects of varying the number of intervals and their values over different datasets • Analyze speed and accuracy results over large datasets (all Reuters collection)