Automatic Detection of Social Tag Spams Using a Text Mining Approach

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

Outline • Introduction • Association Discovery by SOM • Tag Spam Detection • Experimental Results • Conclusions

Social Bookmarking –Why? • Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits: • Alleviation of efforts in Web page annotation • Improvement of retrieval precision • Simplification of Web page classification

How folksonomy works? • Simple • A user (ui) annotates a Web page (oj) with a set of tags (Tij). • Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj  O, and tij  T.

Characteristics of Folksonomy • Collaboration • Semantic relatedness • Possibility of spam

Tag Spams • Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages. • Arise for advertisement or promotional purposes. • Misleading users and deterioration of retrieval result.

System Architecture Web pages Tags Preprocessing Web page vectors Tag vectors SOM training Synaptic weight vectors Labeling Page clusters Tag clusters Association discovery Page/tag associations

Preprocessing • Bag of words approach • Web page Pi is transformed to a binary vector Pi. • Ti, which is the tag list of Pi, is transformed to a binary vector Ti.

SOM Training • All Pi and Ti were trained by the self-organizing map algorithm separately. • Two maps MP and MT were obtained after the training.

Labeling • We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled. • The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).

Association Discovery • Finding associations between page clusters and tag clusters. • We used a voting scheme to find the associations. Ti PCM TCM +1 Pi

Architecture of Tag Spam Detection Incoming Web page Incoming tag list Preprocessing Incoming page vector Incoming tag vector Labeling PCM and TCM Labeled page cluster Labeled tag cluster Page/tag associations Spam detection Tag spams

Spam Detection • Two types of tag spams • Document-scope detection (post-level detection) • The whole tag list is identified as spam. • Tag-scope detection (tag-level detection) • Individual tags are identified as spams. • Let PI and TI be the incoming Web page and its tag list, respectively. • Let PI and TI be labeled to and , respectively.

Document-Scope Detection • Relatedness between page cluster and tag cluster : Q: neighborhood of A = [aij] is the correlation matrix between PCM and TCM. apk = 1 if and are related; otherwise apk = 0 D: geometric distance between two clusters TIis identified as spam if

Tag-Scope Detection • A tag is a spam if it is inconsistent to other tags in the same tag cluster. • Let Ti = {tij } be a tag list and • An incoming tag tIj TIis a spam if tIj W.

Experimental Result • Dataset • 1500 Web page / tag list pairs collected from www.delicious.com • each pair was inspected manually both in post-level and tag-level • 583 distinct Web pages • Sizes of vocabularies • Web pages: 13437 • tag lists: 5157 • average number of tags per page: 4.7

Experimental Result • Parameters • map sizes • PCM: 10  10 • TCM: 10  10 • training epochs • PCM: 400 • TCM: 200 •  : 0.7

Experimental Result • Number of training / test data: 1000 / 500 • Confusion matrix for document-scope detection • Accuracy = (118 + 273) / 500 = 78.2% • Recall = 118 / (118 + 44) = 72.8% • Precision = 118 / (118 + 65) = 64.5%

Further Result of Document-Scope Detection • Result after 10-fold cross validation • Confusion matrix • Accuracy = (123.1 + 271) / 500 = 78.8% • Recall = 123.1 / (123.1 + 43.6) = 73.8% • Precision = 123.1 / (123.1 + 62.3) = 66.4%

Further Result of Tag-Scope Detection • Result after 10-fold cross validation • Confusion matrix * average number of tags per page • Accuracy = (1.4 + 2.2) / 4.7 = 76.6% • Recall = 1.4 / (1.4 + 0.4) = 77.8% • Precision = 1.4 / (1.4 + 0.7) = 66.7%

Conclusions • A novel scheme for tag spam detection based on text mining. • Relatedness between Web pages and tags were discovered based on self-organizing map. • Use only the content of Web pages instead of user behaviors.

Thanks for your attention.

Automatic Detection of Social Tag Spams Using a Text Mining Approach

Automatic Detection of Social Tag Spams Using a Text Mining Approach

Presentation Transcript

Text Mining

Text mining- text analytics- data mining

A DATA MINING APPROACH TO CHANGE DETECTION USING SPECTRAL CHANGE MEASURING FEATURES

Text Mining

Anomaly Detection Using Data Mining Techniques

Text Mining

Text Mining

A New Approach for Video Text Detection and Localization

Mining Reference Tables for Automatic Text Segmentation

Mining Tag Semantics for Social Tag Recommendation

Text Mining

A Survey of Methodaology of Fraud Detection Using Data Mining

Anomaly Detection Using Data Mining Techniques

A text mining approach for automatic construction of hypertexts

A text mining approach on automatic generation of web directories and hierarchies