220 likes | 367 Vues
Automatic Detection of Social Tag Spams Using a Text Mining Approach. Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung. Outline. Introduction Association Discovery by SOM Tag Spam Detection Experimental Results Conclusions.
E N D
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung
Outline • Introduction • Association Discovery by SOM • Tag Spam Detection • Experimental Results • Conclusions
Social Bookmarking –Why? • Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits: • Alleviation of efforts in Web page annotation • Improvement of retrieval precision • Simplification of Web page classification
How folksonomy works? • Simple • A user (ui) annotates a Web page (oj) with a set of tags (Tij). • Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.
Characteristics of Folksonomy • Collaboration • Semantic relatedness • Possibility of spam
Tag Spams • Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages. • Arise for advertisement or promotional purposes. • Misleading users and deterioration of retrieval result.
System Architecture Web pages Tags Preprocessing Web page vectors Tag vectors SOM training Synaptic weight vectors Labeling Page clusters Tag clusters Association discovery Page/tag associations
Preprocessing • Bag of words approach • Web page Pi is transformed to a binary vector Pi. • Ti, which is the tag list of Pi, is transformed to a binary vector Ti.
SOM Training • All Pi and Ti were trained by the self-organizing map algorithm separately. • Two maps MP and MT were obtained after the training.
Labeling • We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled. • The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).
Association Discovery • Finding associations between page clusters and tag clusters. • We used a voting scheme to find the associations. Ti PCM TCM +1 Pi
Architecture of Tag Spam Detection Incoming Web page Incoming tag list Preprocessing Incoming page vector Incoming tag vector Labeling PCM and TCM Labeled page cluster Labeled tag cluster Page/tag associations Spam detection Tag spams
Spam Detection • Two types of tag spams • Document-scope detection (post-level detection) • The whole tag list is identified as spam. • Tag-scope detection (tag-level detection) • Individual tags are identified as spams. • Let PI and TI be the incoming Web page and its tag list, respectively. • Let PI and TI be labeled to and , respectively.
Document-Scope Detection • Relatedness between page cluster and tag cluster : Q: neighborhood of A = [aij] is the correlation matrix between PCM and TCM. apk = 1 if and are related; otherwise apk = 0 D: geometric distance between two clusters TIis identified as spam if
Tag-Scope Detection • A tag is a spam if it is inconsistent to other tags in the same tag cluster. • Let Ti = {tij } be a tag list and • An incoming tag tIj TIis a spam if tIj W.
Experimental Result • Dataset • 1500 Web page / tag list pairs collected from www.delicious.com • each pair was inspected manually both in post-level and tag-level • 583 distinct Web pages • Sizes of vocabularies • Web pages: 13437 • tag lists: 5157 • average number of tags per page: 4.7
Experimental Result • Parameters • map sizes • PCM: 10 10 • TCM: 10 10 • training epochs • PCM: 400 • TCM: 200 • : 0.7
Experimental Result • Number of training / test data: 1000 / 500 • Confusion matrix for document-scope detection • Accuracy = (118 + 273) / 500 = 78.2% • Recall = 118 / (118 + 44) = 72.8% • Precision = 118 / (118 + 65) = 64.5%
Further Result of Document-Scope Detection • Result after 10-fold cross validation • Confusion matrix • Accuracy = (123.1 + 271) / 500 = 78.8% • Recall = 123.1 / (123.1 + 43.6) = 73.8% • Precision = 123.1 / (123.1 + 62.3) = 66.4%
Further Result of Tag-Scope Detection • Result after 10-fold cross validation • Confusion matrix * average number of tags per page • Accuracy = (1.4 + 2.2) / 4.7 = 76.6% • Recall = 1.4 / (1.4 + 0.4) = 77.8% • Precision = 1.4 / (1.4 + 0.7) = 66.7%
Conclusions • A novel scheme for tag spam detection based on text mining. • Relatedness between Web pages and tags were discovered based on self-organizing map. • Use only the content of Web pages instead of user behaviors.